# Introduction to machine learning I

### 1. Marriage of machine learning and social science: Gebru et al. (2017)

**1.1 Two social science questions**

A social science question

> **how to estimate the demographic makeup of neighborhoods across the United States?**

and an even more interesting question

> **how does the demographic makeup of neighborhoods affect the presidential election?**

Social scientists might have been studying such problems for decades.
And now another community also become interested in such problems and come with some new data, new idea and new methods.
That is machine leanring community.

**1.2. Answers from machine learning researchers**

Answer to the above questions from some machine learning researchers in Stanford:

> **Cars**


Why?

* *Popularity*: cars are everywhere in every district of the United States
* *Diversity*: cars are diverse enough to reflect the different taste of people (saving/consuming, altitute to foreign countries, sensitivity to the income change, etc)

Next questions: **how to get data about cars?**

A naive answer would be a census of the cars of people in every district of USA which will cost huge money and resources. (every year USA spends over 1 billion dollars on census). 

Another answer
> use the **auto-detection** and **classification** techniques from machine learning and computer vision

more specifically,
> Using **deep learning-based computer vision techniques** to estimate socioeconomic characteristics of regions spanning 200 US cities by using 50 million images of street scenes gathered with Google Street View cars

![Google View](figures/car.jpg)

**1.3. The effect of machine learning techniques**

Gebru et al. (2017) create a data set of cars in the 50 million Google view images using deep learning techniques. It is free to download here [Visual Census: Fine-Grained Car Dataset](https://ai.stanford.edu/~tgebru/car_data.html)

***Q1: How green is each state?***

![green](figures/green.jpg)

***Q2: Can we predict the income?***

![income](figures/income.jpg)

***Q3: Can we predict the race distribution?***

![race](figures/race.jpg)

***Q4: Can we predict the voting pattern?***

![vote](figures/vote.jpg)


Now we will study the techniques behind such great success: **supervised learning**

### 2. Supervised learning

> Supervised learning is the machine learning task of inferring a function from labeled training data.

**2.1. Mathematical definition**

Mathematically, it means there is an unobservable function $f: X \rightarrow Y$ where  $X$ is the input, $Y$ is the label.
Given the data $(X_1, Y_1), \dots, (X_n, Y_n)$, we want to use an algorithm to output an estimate of the function $\hat{f}$.

**2.2. Basic ideas**

#### Training, testing data

Usually, we will divide the data set into 3 parts: 

* _training data_ : learn the parameters of the model (sometimes, also hyperparameters)
* _testing data_ : estimate the performance of the model

**Prediction accuracy**

Supervised learning can be divided into two catergories based on the output of this function:

* regression problem: $Y \in \mathbb{R}$
* classfication problem: $Y \in \{0, 1, \dots, n\}$

In machine learning, the prediction accuaracy is the mean squared error between the predicted values and the real values in the regression problem and the classification correctness rate for the classification problem.

**Generalisation error and overfitting**

![overfitting](https://raw.githubusercontent.com/DS-100/textbook/master/assets/feature_train_test_error.png)


There have been many methods nowadays:
* linear regression
* nearest neighbor
* support vector machine
* random forest
* gradient boosting machine
* neural network (aka deep learning)

all these methods can be found in `sklearn` package which is the one of the most popular machine learning packages in Python.

Anaconda has installed `sklearn` module.

### 3. Typical machine learning algorithms in `sklearn`

**3.1 Data set**

we use the `Guerry` data set from `statsmodels` to study a regression problem: can we predict the sales of the lottery?

In [3]:
import statsmodels.api as sm

# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# remove all the non-numerical columns
dat.drop(["dept", "Region", "Department", "MainCity"], axis=1, inplace=True)
# list of the variables
dat.head(5)

Unnamed: 0,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,Wealth,Commerce,Clergy,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,28870,15890,37,5098,33120,35039,73,58,11,71,60,69,41,55,46,13,218.372,5762,346.03
1,26226,5521,51,8901,14572,12831,22,10,82,4,82,36,38,82,24,327,65.945,7369,513.0
2,26747,7925,13,10973,17044,114121,61,66,68,46,42,76,66,16,85,34,161.927,7340,298.26
3,12935,7289,46,2733,23018,14238,76,49,5,70,12,37,80,32,29,2,351.399,6925,155.9
4,17488,8174,69,6962,23076,16171,83,65,10,22,23,64,79,35,7,1,320.28,5549,129.1


The detailed information about the data set can be found [here](https://rdata.pmagunia.com/dataset/r-dataset-package-histdata-guerry)

In [4]:
dat.describe()

Unnamed: 0,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,Wealth,Commerce,Clergy,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,19754.406977,7843.05814,39.255814,7075.546512,19049.906977,36522.604651,43.5,42.802326,43.430233,43.5,43.511628,43.5,43.5,43.5,43.127907,141.872093,207.95314,6146.988372,378.628721
std,7504.703073,3051.352839,17.364051,5834.595216,8820.233546,31312.532649,24.969982,25.02837,24.999549,24.969982,24.948297,24.969982,24.969982,24.969982,24.799809,520.969318,109.320837,1398.24662,148.77723
min,2199.0,1368.0,12.0,1246.0,2660.0,3460.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,762.0,129.1
25%,14156.25,5933.0,25.0,3446.75,14299.75,15463.0,22.25,21.25,22.25,22.25,22.25,22.25,22.25,22.25,23.25,6.0,121.383,5400.75,283.005
50%,18748.5,7595.0,38.0,5020.0,17141.5,26743.5,43.5,42.5,43.5,43.5,43.5,43.5,43.5,43.5,41.5,33.0,200.616,6070.5,346.165
75%,25937.5,9182.25,51.75,9446.75,22682.25,44057.5,64.75,63.75,64.75,64.75,64.75,64.75,64.75,64.75,64.75,113.75,289.6705,6816.5,444.4075
max,37014.0,20235.0,74.0,37015.0,62486.0,163241.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,4744.0,539.213,10000.0,989.94


In [5]:
Y = dat['Lottery']
dat.drop('Lottery', axis=1, inplace=True)

We randomly divide the data set into `train_X`, `train_Y` , `test_X` and `test_Y` where the size of training, validation and test  $\sim 4:1$

In [17]:
import numpy as np

train_X = []
train_Y = []
test_X = [] 
test_Y = []

for idx, row in dat.iterrows():
    r = np.random.randint(5)
    if  r < 4:
        train_X.append(row.values)
        train_Y.append(Y[idx])
    else:
        test_X.append(row.values)
        test_Y.append(Y[idx])

print("There are " + str(len(train_Y)) + " training data")
print("There are " + str(len(test_Y)) + " test data")

There are 70 training data
There are 16 test data


**3.2. Linear model**

In [18]:
# OLS
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
ols = linear_model.LinearRegression()
ols.fit(train_X, train_Y)
ols_pred_Y = ols.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, ols_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, ols_pred_Y))

Mean squared error: 428.82
Variance score: 0.42


In [19]:
# Lasso
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score
lasso = LassoCV(cv=5, random_state=0)
lasso.fit(train_X, train_Y)
lasso_pred_Y = lasso.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, lasso_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, lasso_pred_Y))

Mean squared error: 524.26
Variance score: 0.29


**3.3. $k$-Nearest Neighbor**

In [22]:
from sklearn import neighbors
n_neighbors = 10
knn = neighbors.KNeighborsRegressor(n_neighbors, weights='uniform')
knn.fit(train_X, train_Y)
knn_pred_Y = knn.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, knn_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, knn_pred_Y))

Mean squared error: 587.54
Variance score: 0.20


**3.4. Support vector regression**

In [23]:
from sklearn import svm
clf = svm.SVR()
clf.fit(train_X, train_Y)
clf_pred_Y = clf.predict(test_X)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(test_Y, clf_pred_Y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(test_Y, clf_pred_Y))

Mean squared error: 771.00
Variance score: -0.05


