# Introduction to machine learning I

Thursday, September 26

### Content

- [1. Machine learning and social science: an example](#1.-Machine-learning-and-social-science:-an-example)
- [2. Coding the MLE estimator](#2.-Coding-the-MLE-estimator)
- [3. MLE with 'statsmodels'](#3.-MLE-with-'statsmodels')
- [4. Discrete choice models with 'statsmodels'](#4-Discrete-choice-models-with-'statsmodels')

In this Notebook we will... **[TBC]**


### 1. Machine learning and social science: an example

We will use a recently published paper [Gebru et al. (2017)](https://www.pnas.org/content/114/50/13108), written by Stanford computer scientists, to show how machine learning techniques can be used to deliver powerful insights for social scientists.

#### 1.1 Two social science questions

Demographers and political scientists would typically be interested in the following questions:

> **how to estimate the demographic makeup of neighborhoods across the United States?**

> **how does the demographic makeup of neighborhoods affect the presidential election?**

Social scientists might have been studying such problems for decades. The machine learning community has found creative ways to tackle these questions. It has come up with new data, new idea and new methods.

**1.2. Answers from machine learning researchers**

The surprising answer given by the group of Stanford scholar is:

> **Cars**


Why?

* *Popularity*: cars are everywhere in every district of the United States, and they are easily observable in public spaces
* *Diversity*: cars are diverse enough to reflect the different taste of people (saving/consumption, attitude toward foreign goods, sensitivity to income changes, etc.)

Next questions: **how to get data about cars?**

A naive answer would be a census of the cars of people in every district of USA which will cost huge money and resources. (every year, the US federal government spends over $1 billion on the census). 

Another answer, taking advantage of recent advances in machine learning:
> Use the **auto-detection** and **classification** techniques from machine learning and computer vision

More specifically,
> Leverage **deep learning-based computer vision techniques** to estimate socioeconomic characteristics of neighbourhoods across the US by using 50 million images of street scenes gathered with Google Street View cars

![Google View](https://github.com/arnauddyevre/Python-for-Social-Scientists/blob/master/introduction%20to%20machine%20learning/figures/car.jpg?raw=true)

**[AD: we need a source for this image]**

#### 1.3. The effect of machine learning techniques

Gebru et al. (2017) assemble a massive data set of 50 million Google View images, and identify all the cars in these images using deep learning techniques. It is free to download here [Visual Census: Fine-Grained Car Dataset](https://ai.stanford.edu/~tgebru/car_data.html).

Based solely on the distributions of car models across US neighbourhoods, the authors manage to provide answers to the following questions:

***Q1: How green is each state?***

![green](https://github.com/arnauddyevre/Python-for-Social-Scientists/blob/master/introduction%20to%20machine%20learning/figures/green.jpg?raw=true)

***Q2: Can we predict average income?***

![income](https://github.com/arnauddyevre/Python-for-Social-Scientists/blob/master/introduction%20to%20machine%20learning/figures/income.jpg?raw=true)

***Q3: Can we predict the race distribution?***

![race](https://github.com/arnauddyevre/Python-for-Social-Scientists/blob/master/introduction%20to%20machine%20learning/figures/race.jpg?raw=true)

***Q4: Can we predict voting patterns?***

![vote](https://github.com/arnauddyevre/Python-for-Social-Scientists/blob/master/introduction%20to%20machine%20learning/figures/vote.jpg?raw=true)

We study below the techniques behind their success in predicting important socio-demographic variables with such precision, and geographical granularity: all of them fall in the class of **supervised learning** algorithms.

### 2. Supervised learning

> Supervised learning is the machine learning task of inferring a function from labeled training data.

With supervised learning problems, we observe both the features $X_i$ of an observation and its label, or outcome $Y_i$.

#### 2.1. Mathematical definition

Mathematically, it means there is an unobservable function $f: X_i \rightarrow Y_i$ where  $X_i$ is the input, $Y_i$ is the label.
Given the data $(X_1, Y_1), \dots, (X_n, Y_n)$, we want to use an algorithm to output an estimate of the function $\hat{f}$.

#### 2.2. Basic ideas

**Training, testing data**

Usually, we will divide the data set into 3 parts **[2 parts right? or is there a third missing below?]**: 

* _training data_ : learn the parameters of the model (sometimes also called hyperparameters **[TBC: When are these parameters called hyperparameters?]**)
* _testing data_ : estimate the performance of the model

**Prediction accuracy**

Supervised learning can be divided into two catergories based on the output of this function:

* regression problem: $Y \in \mathbb{R}$
* classfication problem: $Y \in \{0, 1, \dots, n\}$

In machine learning, the prediction accuaracy is the mean squared error between the predicted values and the real values in the regression problem and the classification correctness rate for the classification problem. **[TBC: we need to add formulas for the mean-squared error and the classification correctness rate here]**

**Generalisation error and overfitting**

**[an explanation would be welcome here, the picture seems to be in vacuum here. Also, we need to make sure that pictures are properly referenced. A source should be added]**

![overfitting](https://raw.githubusercontent.com/DS-100/textbook/master/assets/feature_train_test_error.png)

Machine learning is a broad term, that encompass many techniques:
* nearest neighbor and clustering algorithms
* naive Bayes
* kernel functions
* support vector machine
* random forest
* (stochastic) gradient descent
* gradient boosting machine
* neural network (aka deep learning)

All these methods can be found in `sklearn` package which is the most popular machine learning packages in Python.

Anaconda already has the `sklearn` module installed. Hence to use it, we simply need to import it at the beginning of our Notebook.

### 3. Typical machine learning algorithms in `sklearn`

**3.1 Data set**

We use the `Guerry` data set from `statsmodels` to study a regression problem: can we predict the sales of the lottery? The detailed information about the data set can be found [here](https://rdata.pmagunia.com/dataset/r-dataset-package-histdata-guerry) **[Similarly, it would be useful to give afew words of explanations about this dataset and where it comes from. Or why you chose it.]**

In [1]:
import statsmodels.api as sm

# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# remove all the non-numerical columns
dat.drop(["dept", "Region", "Department", "MainCity"], axis=1, inplace=True)
# list of the variables
dat.head(5)

ImportError: dateutil 2.5.0 is the minimum required version

In [4]:
dat.describe()

Unnamed: 0,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,Wealth,Commerce,Clergy,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,19754.406977,7843.05814,39.255814,7075.546512,19049.906977,36522.604651,43.5,42.802326,43.430233,43.5,43.511628,43.5,43.5,43.5,43.127907,141.872093,207.95314,6146.988372,378.628721
std,7504.703073,3051.352839,17.364051,5834.595216,8820.233546,31312.532649,24.969982,25.02837,24.999549,24.969982,24.948297,24.969982,24.969982,24.969982,24.799809,520.969318,109.320837,1398.24662,148.77723
min,2199.0,1368.0,12.0,1246.0,2660.0,3460.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,762.0,129.1
25%,14156.25,5933.0,25.0,3446.75,14299.75,15463.0,22.25,21.25,22.25,22.25,22.25,22.25,22.25,22.25,23.25,6.0,121.383,5400.75,283.005
50%,18748.5,7595.0,38.0,5020.0,17141.5,26743.5,43.5,42.5,43.5,43.5,43.5,43.5,43.5,43.5,41.5,33.0,200.616,6070.5,346.165
75%,25937.5,9182.25,51.75,9446.75,22682.25,44057.5,64.75,63.75,64.75,64.75,64.75,64.75,64.75,64.75,64.75,113.75,289.6705,6816.5,444.4075
max,37014.0,20235.0,74.0,37015.0,62486.0,163241.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,4744.0,539.213,10000.0,989.94


In [5]:
Y = dat['Lottery']
dat.drop('Lottery', axis=1, inplace=True)

We randomly divide the data set into `train_X`, `train_Y` , `test_X` and `test_Y` where the size of training, validation and test  $\sim 4:1$ **[We would need an explanation of how to divide the dataset in training and test data, what's the rule of thumb or the heuristic here?]**

In [17]:
import numpy as np

train_X = []
train_Y = []
test_X = [] 
test_Y = []

for idx, row in dat.iterrows():
    r = np.random.randint(5)
    if  r < 4:
        train_X.append(row.values)
        train_Y.append(Y[idx])
    else:
        test_X.append(row.values)
        test_Y.append(Y[idx])

print("There are " + str(len(train_Y)) + " training data")
print("There are " + str(len(test_Y)) + " test data")

There are 70 training data
There are 16 test data


**3.2. Linear model**


In [18]:
# OLS
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
ols = linear_model.LinearRegression()
ols.fit(train_X, train_Y)
ols_pred_Y = ols.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, ols_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, ols_pred_Y))

Mean squared error: 428.82
Variance score: 0.42


**[We need an explanation of what LASSO does and its mathematical expression]**

LASSO minimises the sum of squared residuals and a penalty term that increases with the dimension of the regression coefficient. In a sense, LASSO penalises model complexity.

$$\arg \min _{\beta} \sum_{i=1}^{N}\left(Y_{i}-\beta^{\top} X_{i}\right)^{2}+\lambda\left(\|\beta\|_{q}\right)^{1 /q}$$


In [19]:
# Lasso
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score
lasso = LassoCV(cv=5, random_state=0)
lasso.fit(train_X, train_Y)
lasso_pred_Y = lasso.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, lasso_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, lasso_pred_Y))

Mean squared error: 524.26
Variance score: 0.29


**3.3. $k$-Nearest Neighbor**

In [22]:
from sklearn import neighbors
n_neighbors = 10
knn = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
knn.fit(train_X, train_Y)
knn_pred_Y = knn.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, knn_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, knn_pred_Y))

Mean squared error: 587.54
Variance score: 0.20


In [23]:
from sklearn import svm
clf = svm.SVR()
clf.fit(train_X, train_Y)
clf_pred_Y = clf.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, clf_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, clf_pred_Y))

Mean squared error: 771.00
Variance score: -0.05




**[I think this Notebook is a bit too short...]**. It would be great if it were expanded with more techniques/methods. We should also add a class exercise with TensorFlow.

### References

**Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., & Fei-Fei, L.** (2017). Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. *Proceedings of the National Academy of Sciences*, 114(50), 13108-13113.

**Athey, S., & Imbens, G. W.** (2019). Machine learning methods that economists should know about. *Annual Review of Economics*, 11.

https://www.tensorflow.org/tutorials/keras/basic_text_classification