# Why Python?

## Python vs Stata/SAS/...

The way we choose to use Python is forming matrices, which are usually simpler to operate with, since our estimators are often given in a matrix form.

Caveat: memory usage. Your data may not fit in the memory as a matrix.

SAS/Stata (also SQL) is going to execute tasks line-by-line, and defining operations that are 'vertical' can be relatively difficult. For example, in Stata you use egen, loops or the matrix language 'territory' (MATA) created for this reason. However, these elements are not optimized, or quite foreign to the original way of thinking of those softwares, which can create problems, and make code hard to read (and write).

That being said, Stata is your best choice if you are doing applied economics research and do not want to venture out of the area of well-developed .ado files of econometrics papers. The econometrician wants to make your life easy, so unless Stata is really inefficient/cumbersome, they will implement the empirical strategy there.

Interestingly, Stata suffers from the memory usage problem regardless, and we have more tools available (to my knowledge) in python and alike to handle it when the problem arises and we want to 'stick to matrices'. On the other hand, I believe Stata has better built-in algorithms to reduce unnecessary memory usage than other tools, so you will be able to fit 'more' data into your RAM with it. Theoretically, the line-by-line thinking and progressing lets you get rid of the memory problem, and for example SAS and SQL has that definite advantage.

## Python vs R/Matlab

Say, you would like to incorporate a machine learning element into your dissertation. In my opinion, that is a good reason to leave Stata, as canned implementations are limited in this area. However, theere is a software that is a 'language' that is dedicated to statistics: R. MatLab is good for macro people.

Python is a general puprose language, which means that you have access to practical solutions coming straight from CS more directly. However, many times the implementation does not take advantage of the specificity of the problems you encounter (which are very special to us vs engineers, for example). However, this will mean that if you have a 'technical' problem (for example, need a specific data structure/file extension to be able to run you code at all) or looking for speed, generally python will be better. Partly R and definitely MatLab are famous for having some speed issues once things get tough.

That being said, there is a lot of people writing stats-like code in python nowadays, and there is a huge community. However, a caveat: more people write econometrically unsound codes on their blogs, because there is not such a tight-knit community in python as in Stata (with the ado files checked by Stata) and R (everybody can be a data scientist nowadays, pheww).

All in all, I would say that if you start from scratch, it does not matter which one you do. Python is easier to learn in my opinion, and it is a much more beautifully laid out language than R. If you are not very good at abstractions and do not want to learn CS jargon, R might be a better choice. If your dissertation will be the most technical thing you will ever do in your life, I do not think it matters which one you choose, and if you actually consider a career in this field, you will probably end up having to learn both tools anyway.

One of the reasons why Python (or C++/...) is preferred in 'real life production' is that it allows you to organize notions/objects in a big projects efficiently. Larger IO projects nowadays use the language as it was intended: for object oriented programming (i.e. defining classes). However, much of the econometrics-statistics stuff I have seen (not to mention labor projects) will keep it simple and only define functions (on already predifined objects). There is not chance to talk about OOP in this intro, but it is one of the reasons why you would encounter python often in the 'real world'.

# Why are we here?

I am going to
- give you some information on how to start interacting with python
- introduce you to some of the most important packages (from our point of view)
- give an example code of a regression.

# Tools to access the processors via Python code

What we often refer to 'Python' is really consists of the Standard Library (https://docs.python.org/3/library/) and a collection of other packages developed sometimes independently by the people who created the language.

Python is a shared resource with developers from other fields, and it is very hard to keep track on what package is developed when, and what versions of packages need some specific versions of other packages. Therefore, when installing you should not go to python.org, but to one of the firms that maintain a 'distribution' of packages. These are going to take care of the dependencies between the packages you need. The distribution I use and that is also installed on your computer in the lab is by Continuum, and it is called Anaconda. An alternative is 'pip' (Unix users). It is key that you only update your packages (including the IDE!) by using conda, if you have Anaconda.

1. The simplest tool to recon with is the notebook you are using now, __Jupyter__. It is great, since you can actually edit and execute JUlia, PYThon and R code with it (on different tabes). However, I mostly using Jupyter for teaching or exploring a data set, maybe. You can write 'Code' and 'Markdown' in cells. (See below.) Collaborating and version control is not that great, although if you really like it, now there is Jupyterlab which is like a more data science project version of Jupyter (haven't used it).

2. The other tool that is included with Anaconda is the __Spyder__ IDE. This looks like Matlab, as it keeps trakc of all your created objects in the memory. This can be incredibly useful for us. (Note: I am pretty sure you can use Jupyter for development only if you are _really_ good, but I always use Spyder for initial development, like 80% of time, and then emacs/vim/notbeook++ later.) An alternative to Spyder would be CharmPy, which is better suited for larger projects.

3. Further, you can run a python code from the command line or shell. This is a next step we are not going to cover here, but is necessary to do a mid-sized project (like your dissertation, probably).

## Important packages

- __ipython__: ='Interactive python'. This package makes python more convenient to use (any IDE and interactive shell will use this), enhances parallel computing capabilities and how your figures and typesetting looks like. If you can, choose an ipython kernel rather than just a python kernel. (You most probably will by default anyway.) You do not have to know anything anymore about this.

- __numpy__: This is where arrays live, which are kind of like precursors for matrices. This is probably the most important package for us. It works will with Intel chips.
- Scipy: You use this package for optimization and generating some random numbers.

- __pandas__: This package defines the 'dataframe', which should be familiar from R. This is what most people with small or misized projects use for basic data set manipulation. While it can be slower than a well-optimized numpy code, it has built-in parallel computing, so in your first year or so, pandas is the way to go (unless.. you guessed well, you run out of memory - the dataframe hsa obviously more overhead).

- __statsmodels__: This is where people run regressions and fit simple statistical models (some time series, very limited IV, etc).

- scikit-learn: This is where you would fit simple machine learning tools. You should go through this library: https://scikit-learn.org/stable/index.html. Very nice overview, but be careful: some of the language is not how we talk in economics, and if you do weird 'preprocessing' steps, you will have a problem with your advisor. 
- nltk: The natural language processing toolkit. Not today, S, not today.
- PIL/pillow: if you need to work with pics (pillow.)

- __matplotlib__: This is how our type of people will create figures. To be specific, I have only used __matplotlib.pyplot__ in my life. That being said, I am not a plot-person.

- re: Regular expressions for data cleaning. Good examples: https://www.w3schools.com/python/python_regex.asp

IPython is usually included in your environment by default. The rest of the packages are needed to be _imported_, if you would like to use them. For example by writing:

In [None]:
import numpy as np  # You want to import all your packages at the very beginning of the code.

# Basic elements of the language

## Simple data structures

On the one hand, we have the usual stuff:

str (string)
boolean
int
float

which are the sequences of characters and number formats, respectively. 


Already strings are acually defined as sequences, out of which we mention two main types:
__lists__: L=[2,3,6,4, ['list', 'of' , 'a','list']]
__tuples__: T=(3,4,55)

Members of a list/tuple are delimited with commas, and numbered by starting with ZERO (the weird thing compared to R).
This becomes important when you would like to access one or more elements of the list ('slicing'), that is done by writing brackets after the name of the list/tuple. For example L[0:2] gives the first __2__ elements of L. Of course there are already some operators defined on lists. What do you think will happen and what happened? Play around a bit:

In [None]:
L=[3,4, 'this_is_a_string', 'Hello world!', ['another', 'list']]

In [None]:
print(L) # The print function!

In [None]:
L[0:2]

In [None]:
L[3]

In [None]:
L[0:4:2]

In [None]:
print(L*2)

In [None]:
L2= L[4]
print(L+L2)    # list addition (appending)

Note that after slicing, you get back a list automatically. Lists are mutable, I can redefine them, but tuples are not. Try for example

In [None]:
tup1=(2,3,4,(3,4))

In [None]:
tup1[2]=55

You could not change the 3rd element of the tuple, but the same would work with lists:

In [None]:
L[1]=55
print(L)

Note that lists are not the same as sets, as two lists are not the same if the order of the items is not the same, and you can of course repeat the same elements:

In [None]:
L1=[1,2,3]
L2=[3,2,1]
if L1==L2:
    print('They are the same')
else:
    print('They are not the same')

## If statements and loops

If statements are done by 'if' and then creating a boolean, usually with a relation ($a==b$ or $a~=b$ or $a>=b$), and finishing with a colon (:). You can use of course the usual operators as well (or, and, not) and so on. If needed, the second branch can be created with an 'else:'

If you would like to branch into 3 ways, you need to employ and 'elsif:' in between the 'if' and 'else' line.

You signal python what is the part of code that belongs to the branch by __indenting__ (with a tab or 4 spaces); this is instead of all those curly brackets (or worse) in other languages.

What do these commands do? Run them! Were you right?

In [None]:
x= int(input('Give an integer!'))

if x%3 ==0 or x<=0:
    print(x, ' is negative or divisible by 3.')
elif x%2 ==0:
    print(x, ' is even and not divisible by 3')
else:
    print( x, ' is odd and not divisible by 3')

Loops are allowing to repeat the same set of lines as many times as it is needed. There are while and for loops available (and probably more), I will only show the for loop here. You need to use the range() object for loops if you can for iteration.

What does the following code do?

In [None]:
for i in range(1,20,1):
    print(i)
    print(' Mississippi, ')

In [None]:
L=list(eval(input('Give a list of numbers delimited with commas!')))

sum=0
for i in range(len(L)):
    sum=sum+int(L[i])**2

print(sum)

# Numpy and arrays/matrices

Arrays are kind of like lists, but you can control better what type of data you put into them, they have advanced slicing capabilities (we are not going to explore here) and also, there is something called broadcasting they do. Moreover, the mathematical operations are defined now as usual elementwise operations with vectors. Arrays can be 1-dimensional, if you write arrays of arrays, then that is a 2 dimensional array, arrays of arrays of arrays are 3-dimensional, etc. A 2-dimensional array with special 'matrix' designation are going to be the matrices.

Arrays are defined by the numpy module/package. They behave like n-dimensional vectors, except for multiplication:

In [None]:
arr1= np.array([1, 2,3, 4])
arr2= np.array([34,5,1,43])
print(arr1*arr2)

In [None]:
arr1+arr2

In [None]:
3*arr1

Multidimensional arrays include 2-dimensional arrays. Indexing now requires 2 numbers, naturally. Slicing works the same way, almost (not today why is not same).

In [None]:
arr3=np.array([[1,2,3],[4,5,6]])
arr4=np.array([[5,3,4],[8,5,3]])
print(arr4)

In [None]:
arr3*arr4

In [None]:
arr3[1,:]

In [None]:
arr2[0:3]*arr3

This last line is the example for broadcasting, which can be a wonderful thing. Unforunately, we cannot get into details.

Arrays are _almost_ matrices for 2-dimension. There is a separate data structure called 'matrix', it is easy to create it with numpy:

In [None]:
x=np.matrix([[1,2,3],[4,5,6]])

In [None]:
x

In [None]:
arr3

In [None]:
y=np.matrix([[ 5,  6, 12],
       [32, 25, 18]])

In [None]:
x*y

In [None]:
x*np.transpose(y)

So this is matrix multiplication then. You can also make python think that an array is a matrix _temporarily_ by putting it into np.asmatrix(arr3) (which is just x then, for that specific usage).

# Pandas and importing data
We are going to do a very simple regression. First, you need to import your data set. The data structure closest to a 'Data Set' type like in Stata or the DataFrame in R is defined by the pandas package/module. Since we are using it, of course we need to import it first.

Then you want to locate the file, and tell the directory where you are working to python (in fact, you are using ipython capabilities now!!!!). Since I am sloppy, I want the data folder to be where my python code is stored. You can change directories with the usual 'cd' command (brought to you by IPython).

Then you can use the import function you can see from the pandas help that matches your file type. Todd gave me this small csv data set, so we need to use.... 

In [None]:
import pandas
%cd "C:\Users\ptoth\Dropbox\UNR 2019 Fall\Misc"

thadata=pandas.read_csv('state_level.csv', header=0)

Now let us see what is in the data set (first 10 rows):

In [None]:
thadata[:10]

The DataFrame is just some kind of list of records (observations), so what is the length?

In [None]:
len(thadata)

How do you refer to the 'incwage' variable?

In [None]:
thadata.incwage

Note: it created a new DataFrame, maybe?? The point is: this is like slicing. Now let us get the 26th observation's incwage value as you should be doing it in pandas. It turns out that the type of slicing above is only used by terrible people like me.

In [None]:
thadata.loc[26, 'incwage']

In [None]:
thadata.loc[:,'incwage']

Let us add a constant variable to the 'X-s'

In [None]:
thadata['constant']=1

For further details how to name variables, how to do bysort, all sorts of slicing, please see the pandas documentation. The point is, this is an environment that is very similar to the R you are used to.

You can read in from text files, .dta and excel files pretty simply. Not so much from MatLab files. For exporting, if you would only need data for pandas usage, you should use 'pickle', which created panda's nice and really fast data file type (.pkl). You can very simply pickle and unpickle dataframes.

There is a later 'version' (another package of pandas) that can accommodate somewhat larger data sets (although I am pretty sure 95% of you will not reach the limit of pandas with little effort).

## Honorary mention: regex for data cleaning
Some of your data set is in Ukrainian? Some people entered phone number as (512) 3342-566 while others +3612772848? Welcome to the re module, which can solve your problem. For data cleaning problems like these, go and check it out. Fascinating.

In [None]:
import re

# Running regression in Statsmodels
To be fair, linear regression is present in sci-kit learn as well (of course). However, all the little nuances you may need for your paper are not simply implemented. Moreovre, Statsmodels is 'pandas-aware', so you can use your data frame and if you like R, it offers a similar syntax. (However, you can also use numpy arrays!)

Let us import the package, and then fit an OLS model. I give a one-liner and the generic procedure with numpy arrays

In [None]:
import statsmodels.api as sm

In [None]:
#y= numpy.array(thadata.loc[:,'incwage'])
#X= numpy.array(thadata.loc[:,'age']) #same as thadata['age'] here - you can be sloppy for this line.
#X=sm.add_constant(X)
#model= sm.OLS(endog=y, exog=X)
#results=model.fit()
#print(results.params)
results= sm.OLS(thadata['incwage'], thadata[['constant','age']]).fit()

In [None]:
print(results.summary())

In [None]:
results2= sm.OLS(thadata['incwage'], thadata[['constant','age']]).fit(cov_type='HC1')
print(results2.summary())

## Now let us calculate the coefficient with pure numpy

In [None]:
y= numpy.matrix(thadata['incwage']).transpose()   # creates annoying row matrices otherwise
X= numpy.matrix(thadata[['constant', 'age']]) # but this doesnt!
               
print(np.shape(y), np.shape(X))
params=np.linalg.inv(X.transpose()* X)*X.transpose()*y

print('constant parameter:', params[0])
print('slope parameter:', params[1])

Homework: Calculate the standard errors assuming homoskedasticity, then the heteroskedasticity-robust (White) standard errors!

We are going to do some post-estimation work and plotting instead. First, let us calculate the residuals in the old-fashioned way, with numpy. Note that actually you could get it very simply with statsmodels.api.

In [None]:
resids= numpy.array(y-X*params) # later we need this thing as an array

# Some plotting

For plotting, you should start using just matplotlib.pyplot. So let us importb it, and do a scattor plot of the residuals and age. 

In [None]:
import matplotlib.pyplot as plt
plt.scatter(thadata['age'], resids)

So why did we do this? Which assumptions is holding/not holding from the multivariate model?

Now let us see a histogram, the histogram of the residuals. Which assumption can we check this way?

In [None]:
plt.hist(resids) 
plt.title("histogram of residuals") 
plt.show()

Manipulation of figures (saving it to the computer, having groups of figures) is a bit more lengthy, so we do not talk about it today. Matplotlib has excellent help. The only thing you have  to remember is that everything (including the picture above) is an object for python. You have to be able to put it into a bin: it is going to be a 'figure'. You need to call this picture a figure (belonging to the 'figure' class), and then you can save it as a png to the working folder.

# Spyder
Ok, we have this interactive environment. But how do you write a 'do-file'? I suggest you write it in Spyder. There you have a more suitable environment to check your progress with creating new variables and results.

Exercise (if time): I gave you the Wooldridge Data Set. Please read in the data from the csv file, run a regression on hprice (houseprice) with statsmodels using at first the 'basic' standard errors, then the heteroskedasticity robust standard errors. (Use every variable for the RHS.) Plot the residuals' histogram and the scatter plot between the residuals and 1 RHS variable chosen by you. What do you conclude?

# Further steps

The packages I gave you above are very nice, because they have nice documentation, maybe the only exception is Statsmodels (but it is still accessible). You should prepare that those people call certain things in econometrics differently.

I think a great resource to start learning python is this free book: Learn Python the hard way, by Zad A. Shaw. There must be a odf that is free online. However, it seems UNR subscribed to the 'premium' content of the book as well, so you only need your NetID to access the interactive thing as well. You lucky. I think you can get done with this book in a couple of weeks, and you are good to go to get acquainted with statsmodels and scikitlearn and the others. The only thing you may want to be careful with is that some of these resources are written for python 2, not python 3. You should start coding in python 3. There are not many different points at this level (print function, division and xrange vs range that comes to mind only), so it will not affect the learning as much as it seems at first.

Thank you for listening.