# Python: First Steps
---

**Structure**

0. Why Python?
1. Install Python
1. Install and Import Modules
2. Numpy
3. Pandas


---
---

## 0. Why Python?

**Pros**: 

* It is a General Purpose Language.    
* Can be with within other languages.
* **Open Source.**
* A lot of Machine/Deep Learning Tools.

**Cons**: 
* Somewhat difficult to set up properly.
* It is possible to mess up your computer.
* Few out of the box econometric tools (IV Regression with proper Tests will be hard to find).


***When is python a "must"?***
> For state of the art Deep Learning Models (e.g. in Natural Language Processing (NLP) or Computer Vision)



---
---

## 1. Install Python
More: [Beginners Guide](https://wiki.python.org/moin/BeginnersGuide)


0. **Download and Install [Anaconda with Python](https://www.anaconda.com/products/individual).** 
1. Open the Anaconda Navigator and create a virtual environment following the official [Instructions](https://docs.anaconda.com/anaconda/navigator/getting-started/).
2. Open "Anaconda Navigator> Home" and install an IDE (e.g. spyder, pycharm or jupyterlab).
3. Open the IDE and you are ready to go.

Online Alternative: Use [google colab](https://colab.research.google.com/)


---
---

## 2. Install and Import Modules 


> Modules turn python into a Data Science language.

* [**Numpy**](https://numpy.org/): provides an array-structure and powerful linear algebra tools.
* [**Pandas**](https://pandas.pydata.org/): provides a dataframe-structure compareable to R-dataframes.
* [**Matplotlib**](https://matplotlib.org/): Plots similar to the matlab plot.
* [**Scipy**](https://www.scipy.org/): builds on numpy and provides statistical functions.
* [**scikit-learn**](https://scikit-learn.org/stable/): basic ML and estimation toolbox.
* [**Statsmodels**](https://www.statsmodels.org/stable/index.html): more "econometric" estimators like 2SLS, GMM and ARIMA
* [**Pytorch**](https://pytorch.org/): powerful deep learning framework.

To use python module they first have to be installed in the environment.

In [None]:
# The following line of code installs numpy and pandas
!conda install numpy pandas matplotlib -y 

---
---
## 3. Numpy

Numpy provides a large linear algebra toolbox.

In [None]:
# import numpy
import numpy as np

### Some Basics

In [None]:
# generate some an array
nparray = np.array([1, 2, 3, 21, 12, 4, 2, 5, 6])
print(nparray)

In [None]:
# by default operations are elementwise.
nparray*nparray

In [None]:
# calculate the inner product.
nparray @ nparray

In [None]:
# There are multiple ways to do things (function vs methods)
np.sum(nparray) == nparray.sum()

### Indexing
Indexing start at 0.

In [None]:
# remind us of the array
print(nparray)

In [None]:
# indexing starts with 0
nparray[3]

In [None]:
# indexing with [a:b] includes values at postion "a" to position "b-1"
nparray[0:3]

#### A Caveat
A simple array in numpy is not a row or columns vector

In [None]:
# remind us of the array
print(nparray)

In [None]:
# it is always the inner product
nparray.T @ nparray == nparray @ nparray.T  # the right hand side should be a matrix

In [None]:
# The shape of the array shows that a second dimension is missing
nparray.shape

In [None]:
# reshape the array to a column vector
vector = nparray.reshape(-1, 1)
vector.shape

In [None]:
# now linear algebra works as expected
vector @ vector.T

---
---
## 4. Pandas

Pandas provides a powerful dataframe structure.

In [None]:
# import pandas
import pandas as pd
# import matplotlib
import matplotlib.pyplot as plt

#### Some Basics

In [None]:
# generate some random data
data = np.c_[np.ones((100,1)), np.random.randn(100,1)]
# read them into dataframe
df = pd.DataFrame(data, columns=['constant', 'randn'])
# first five rows in the dataframe
df.head(5)

In [None]:
# save to csv
df.to_csv('csv_example.txt', sep='\t', index=False)

In [None]:
# read from csv
df = pd.read_csv('csv_example.txt', sep='\t')

In [None]:
# There are a lot of useful tools
df.hist(bins=10)
plt.show()

#### Normal Indexing
Three ways to get the first four values of the "randn" column.

In [None]:
df['randn'][0:4]

### Bool Indexing
Logic values can be used for indexing

In [None]:
df[df['randn'] > 0].head(4)

#### A Caveat
Be careful when copying dataframes - df2 = df1 is a view rather than a copy.

In [None]:
# This causes df2 to point to the same position in memory as df.
df2 = df

In [None]:
# Generate new column with dummy for a positive value
df2['group'] = 1 * (df2['randn'] > 0)
df2.head(5)

In [None]:
# THE GROUP COLUMN WILL ADDED TO BOTH df1 and df2
df.head(5)

In [None]:
# It has to be made explicit tha a copy should be mads
del df['group']  # deleting the group column
df2 = df.copy()

In [None]:
# Generate new column with dummy for a positive value
df2['group'] = 1 * (df2['randn'] > 0)

In [None]:
# THE GROUP COLUMN WILL ONLY NE ADDED to df2
df.head(5)