# Recording of class
Background: people join from Americas/East Asia

Solution: recording of class
- Will be put on secret youtube links - not searchable
- Delete mid June

Consent: via poll

# Session 3:
## Non-linear ML

- A tour of the most essential supervised models

*Andreas Bjerre-Nielsen*

# Review


# Universal Approximation

Definition: Given enough data the algorithm comes as close to sampling distribution as possible
- Note: does not care about causation or selection! Purely prediction
    
Examples?

- We can also make input non-linear using `PolynomialFeatures` of any order.
  - Follows from iterative Taylor expansion
  - Problem: spurious coefficients and large coefficients
- Others approaches? non-linear?

## Agenda
More supervised tools

1. [Measuring classification performance](#Measuring-classification-performance)
1. [Nested cross validation](#Nested-cross-validation)

Non-linear ML models

1. [Kernel-based models](#Kernel-based-models)
1. [Tree-based modelling](#Tree-based-models)
1. [Ensemble learning](#Ensemble-learning)
1. [Neural networks and deep learning](#Neural-networks-and-deep-learning)

## Loading up

In [2]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import seaborn as sns

## Data for this lecture

We begin by loading the titanic dataset, *survived* is target. 

In [3]:
from sklearn.model_selection import train_test_split

df = sns.load_dataset('titanic')\
        .assign(male=lambda df: df.sex=='male',
                age_null = lambda df: df.age.isnull())
drop_cols = ['survived','adult_male','class','who','alive','embark_town','sex', 'deck']
y,X = df['survived'], pd.get_dummies(df.drop(drop_cols, axis=1), dummy_na=True, drop_first=True).fillna(-99)
X_train, X_test, y_train, y_test = train_test_split(X,y)

auc_scores = {}

# Measuring classification performance



## Breakdown by error type (1)

We measure the accaracy as the rate of true predictions, i.e. \begin{align}ACC&=\frac{True}{True+False}\end{align}

Can we decompose?

## Breakdown by error type (2)
Yes, we can decompose into
$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_08.png' alt="Drawing" style="width: 400px;"/></center>

## Breakdown by error type (3)

Some powerful measures:

- Precision: share of *predicted positive* that are true
    - PRE = $\frac{TP}{TP+FP}$    
    - = true positive rate 
- Recall: share of *actual positive* that are true    
   - REC = $\frac{TP}{TP+FN}=\frac{TP}{AP}$ 
   - = 1- false negative rate
- F1: mix recall and precision: $\frac{2\cdot PRE\cdot REC}{PRE+ REC}$


In [1]:
from sklearn.metrics import precision_score, recall_score, f1_score

## Breakdown by error type (4)

Classification models provide a predicted likelihood of being in the class or not:
- Receiver Operating Characteristic (ROC) curve by varying thresholds for predicted true.
    - ROC is a *theoretical* measure of model performance based on probabilities.
    - AUC: Area Under the (ROC) Curve.

## Breakdown by error type (5)

Example of Area Under the (ROC) Curve.

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_10.png' alt="Drawing" style="width: 800px;"/></center>

# Nested cross validation

## Nested cross validation (1)

- Model test does not consider uncertainty from fact that we are also tuning hyperparameters:
  - Leads too overfitting (Varma & Simon 2006; Cawley, Talbot 2010).
- Solution is **nested cross validation**.
  - Validation step should not be modelled as 1) train; 2) test.
  - Better way is 1) model selection: train, validate; 2) test.
  - Implement as pp 204-205 in Python for Machine Learning:
      - first inner loop: `GridSearchCV` 
      - second outer loop: `cross_val_score`

## Nested cross validation (2)
A depiction of the process:

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_07.png' alt="Drawing" style="width: 450px;"/></center>

## Nested cross validation (3)
Example of application: [Bjerre-Nielsen et al. (2021, PNAS)](https://doi.org/10.1073/pnas.2020258118)
  - Show that student surveillance using phones does not need to better models of academic performance
    - Important for algorithmic policy making!

# Kernel-based models

### *Learn from others like you*

## Kernels

What is a kernel? A mapping $k$ that computes *similarity* between two vectors. 

- E.g. how similar are the two observations $x_i$ and $x_j$ from space $\mathcal{X}$? 
- Formally a kernel is a mapping $k: \mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$

How can we use it as a supervised model?


Let be a binary target, $y\in\{1,-1\}$. 


We can use a kernel to weight observations from training data-set:


$${\hat {y}_i}=\operatorname {sgn} \sum _{j=1}^{n}y_{j}k(x _{i},x_j )$$

## Kernels

One example of a kernel is the **k-nearest neighbor**. 

1. Compute $||x_i-x_j||$, i.e. distance norm between $x_i$ and all $x_j$, e.g. Euclidian distance ($L_2$ norm)
2. The kernel $k(x_i,x_j)$ equals $1/w$ if it's one of the $k$ smallest distances; otherwise $k(x_i,x_j)=0$.

Many alternative kernels:
- Radius neighbors, which weights all observations within a threshold $r$ evenly, i.e. if $k(x_i,x_j)<r$.
- E.g. polynomial, radial basis function and many more


## Application of kernels


We can use KNN to aggregate over the $k$ most similar observations. 
- The KNN classifier computes the mode
- The KNN regressor computes the mean
- Note neighbor kernel are common econometrics too 

Generally, we may use the Nadaraya-Watson estimator using for kernel $k$ (aka. kernel regression)

$$\widehat{y}_{k}(x)={\frac {\sum _{i=1}^{n}k(x-x_{i})y_{i}}{\sum _{i=1}^{n}k(x-x_{i})}}$$

## Application of kernels

In [4]:
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
clf_dt = KNeighborsClassifier(n_neighbors=5)
clf_dt.fit(X_train,y_train)
auc_scores['5_nearest_neighbors'] = round(roc_auc_score(y_test, clf_dt.predict_proba(X_test)[:, 1]), 3)
auc_scores

{'5_nearest_neighbors': 0.737}

## Application of kernels


Example of the k-nearest neighbor classifier where the binary values are illustrated as red (= -1) / blue (= 1). 



<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/419px-KnnClassification.svg.png' alt="Drawing" style="width: 400px;"/></center>


*(artist: [Antti Ajanki](https://commons.wikimedia.org/wiki/File:KnnClassification.svg), CC-SA 1.0, no changes)*

## Application of kernels


Another  application is: **Support Vector Machine** (SVM) is like logistic regression (separates green/blue):
<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/926px-SVM_margin.png' alt="Drawing" style="width: 450px;"/></center>

*(artist: [Larhmam](https://commons.wikimedia.org/wiki/File:SVM_margin.png), CC-SA 4.0, no changes)*

## Application of kernels

SVMs have two major advantages over logistic regression
- Computational efficiency (optimization under constraints)
- Has non-linear implementation kernels that are decomposable by inner product space, i.e. $k(x_i,x_j)=\left\langle\varphi(x_i),\varphi(x_j) \right\rangle$

<center><img src='https://upload.wikimedia.org/wikipedia/commons/d/d8/Kernel_yontemi_ile_veriyi_daha_fazla_dimensiyonlu_uzaya_tasima_islemi.png
' alt="Drawing" style="width: 500px;"/></center>



*(artist: [Shehzadex](https://commons.wikimedia.org/wiki/File:Kernel_yontemi_ile_veriyi_daha_fazla_dimensiyonlu_uzaya_tasima_islemi.png), CC-SA 4.0, no changes)*

# Tree-based models

### *Divide and conquer*

## Decision Tree 

Situation - we want to predict who will become a criminal
<center><img src='fig/decision_tree/0001.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

How can we make a tree to determine who becomes criminal?
<center><img src='fig/decision_tree/0002.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Situation - want to predict who will become a criminal
<center><img src='fig/decision_tree/0003.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Split 1: by place of birth
<center><img src='fig/decision_tree/0004.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Split 2: by alcoholic mother
<center><img src='fig/decision_tree/0005.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Alternative split 1: by alcoholic mother
<center><img src='fig/decision_tree/0006.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

How can we automate the splitting process of data?



Yes, we can do as follows:

- Use measure/criterion to evaluate the value of all potential splits
- Apply the criterion iteratively (greedy approach)

## Decision Tree 

Splitting criterion: entropy
<center><img src='fig/decision_tree/0007.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Applying the entropy criterion - split by alcoholic mother
<center><img src='fig/decision_tree/0008.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree 

Applying the entropy criterion - split by place of birth
<center><img src='fig/decision_tree/0009.jpg' alt="Drawing" style="width: 1000px;"/></center>

## Decision Tree - a generalization

Generally, we estimate trees using the Classification And Regression Tree (**CART**) approach:

- At each step evaluate parent (before split) against children (after split)

  - Can use other measures for deciding to split, e.g. gini impurity, variation (sum square pairwise over observations)
  - When used for regression requires different criterion, e.g. variation


## Decision Tree - a generalization

Some terminology of a CART:

<center><img src='fig/decision_tree/CART_general.png' alt="Drawing" style="width: 700px;"/></center>

## Decision Tree - a generalization

What are some of the properties of the CART?
- Main **advantage** has no *underfitting*, can universally approximate
- Main **disadvantage** is high *overfitting* 
- We can trade-off the two by tuning hyperparameters, e.g. maximal depth
  - Another solution is to grow forests rather trees (next topic)

## Decision Tree - python demonstration

We now load the Decision Tree classification model and train it

In [5]:
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
clf_dt.fit(X_train,y_train)
auc_scores['deciscion_tree'] = round(roc_auc_score(y_test, clf_dt.predict_proba(X_test)[:, 1]),3)
auc_scores

{'5_nearest_neighbors': 0.737, 'deciscion_tree': 0.81}

# Ensemble learning

### *Leveraging wisdom of the crowd*

## Ensemble learning
Main idea
- Create and train many supervised models 

How is this implemented?
- Aggregation of independent models (e.g. bagging)
  - E.g. `mode` in classification and `mean` in regression
  - Each model is weighted equally in aggregation
- Sequential improvement of models (e.g. boosting)
  - Next model learn by current model's errors


## Bagging
One way to implement model aggregation:

- What is it?
  - Acronym for **B**ootstrap **Agg**regate **'ing**
  - I.e. repeatedly estimate model on bootstrap samples
  
- Why should we use it?
  - Advantage: Decreases overfitting, implies better generalization performance
  - Disadvantage: Takes a longer to train/estimate model!

## Bagging
A quick reminder of **bootstrap**:

- Given a table of length $n$, randomly select $n$ rows with replacement $B$ times

Example with five draws from dimensional array, $X=[0, 1, 2, 3, 4,5]$: 

In [6]:
import numpy as np

n,B = 6, 5
X = np.arange(n)
bootstrap = np.random.RandomState(0).choice(X,size=[B,n])
bootstrap

array([[4, 5, 0, 3, 3, 3],
       [1, 3, 5, 2, 4, 0],
       [0, 4, 2, 1, 0, 1],
       [5, 1, 5, 0, 1, 4],
       [3, 0, 3, 5, 0, 2]])

## Bagging 


Example of bagging algorithm. Assume we have training data set with features $X$ and target $Y$:
  - For $b = 1, ..., B$:
    1. Draw bootstrap sample from training data $X, Y$; denote $X_b, Y_b$.
    2. Train a classification or regression model $f_b$ on $X_b, Y_b$.

- After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual classification on $x'$:


$$\hat{f}=\frac{1}{B}\sum_{b=1}^{B}f_b(x')$$

## Random forest


*Random forest* is a  popular machine learning method that uses bagging by design:


- Train/estimate the forest:
    1. Draw $B$ bootstrap samples on training data, one for each CART
    1. Grow a decision tree from the bootstrap sample. At each node:
      - Randomly select $d$ features without replacement.
- Apply the forest: 
   - For each observation we aggregate predictions from CARTs 

NOTE: bootstrapping from rows and subsampling from columns



## Random forest

How does a random forest look work? Example of $N=B$ trees:

<center><img src='fig/RF.png' alt="Drawing" style="width: 1100px;"/></center>

*(by [Machado et al. 2015](http://dx.doi.org/10.1186/s13567-015-0219-7) licensed under [CC-BY](https://www.researchgate.net/figure/Random-forest-model-Example-of-training-and-classification-processes-using-random_fig5_280533599))*

## Random forest


Why do we subsample feature? 
  - Reduces overfitting by making trees more independent - key part of the procedure.


Random forest is nice to use - off the shelf and generally good performance, however, not transparent

- When fitting CARTs, we estimate the information gain for every feature at each split
- We can summarize the *importance* of a feature as its relative amount of information gain delivered during classification

## Random forest

We now load the Decision Tree classification model and apply it

In [7]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train,y_train)
auc_scores['random_forest'] = round(roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:, 1]), 3)
auc_scores

{'5_nearest_neighbors': 0.737, 'deciscion_tree': 0.81, 'random_forest': 0.853}

## Boosting

*Social learning*

- Background
  - [Kearns (1988)](http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf) posed the fundamental question: how can we turn weak learners (not universal approximators) into strong learners?
- Answer to Kearns' problem came by the following procedure:
  - Train models repeatedly 
  - For each new model set weights to focus on previous models' errors.
  
- Intuition: Many weak models that learn from each other’s mistakes, combine into one strong model

  

## Boosting

How can we implement boosting?
- ***Ada***<text><text>*ptive* ***Boost***: Sequentially update training weights - upweigh with previous errors of models 
- ***Gradient***<text><text> ***Boost***: Sequentially train on previous errors of models - gradient approach

Why should we use these methods - often works even better than random forest 

## Boosting

Implementation of adaptive boosting: 

1. Create a weight vector $w$ that encodes the importance of each training sample
2. For $j$ out of $m$ boosting iterations:
  1. Train a weighted weak classifier train (e.g. decision tree with max depth 10)
  1. Predict class labels
  1. Update $w$ based on the errors that makes (steps *c* to *f* in Raschka page 248)
3. To make predictions apply weighted voting, i.e. giving more prediction weight to less error prone
classifiers



# Neural networks and deep learning

### Models inspired by biology

## Neural background

What happens when we combine multiple neurons? Network of neurons?

<center><img src='https://www.publicdomainpictures.net/pictures/380000/velka/kunstliche-intelligenz-201103.jpg' alt="Drawing" style="width: 680px;"/></center>

(artist: [Gerd Altmann](https://www.publicdomainpictures.net/en/view-image.php?image=372933&picture=artificial-intelligence-201103), CC-0)

## Neural background

All the linear and logistic models we saw can be expressed in the following form:

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch12/images/12_01.png' alt="Drawing" style="width: 800px;"/></center>

(Image from [Raschka, 2017](https://github.com/rasbt/python-machine-learning-book-2nd-edition))

## Neural background

Recall: logistic regression was one of the simple linear models / neurons

How does the perform on the titanic data?

In [8]:
from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(max_iter=100000)
clf_lr.fit(X_train,y_train)
auc_scores['random_log_reg'] = round(roc_auc_score(y_test, clf_lr.predict_proba(X_test)[:, 1]), 3)
auc_scores

{'5_nearest_neighbors': 0.737,
 'deciscion_tree': 0.81,
 'random_forest': 0.853,
 'random_log_reg': 0.84}

## Neural example

What about slightly more advanced data? 
- Suppose we have circular data: 
  - $y$ is <font style="color:blue;">blue</font> if $x$ near origin; 
  - otherwise, $y$ is <font style="color:red;">red</font>. 
- Let $P(y=red)=0.7$.


## Neural example

How does the model data look like before seeing $x$?

<center><img src='fig/neural/circle_even_likelihood.png' alt="Drawing" style="width: 700px;"/></center>

## Neural example

What if we depict the predictions of logistic regression?

<center><img src='fig/neural/circle_lr_likelihood.png' alt="Drawing" style="width: 700px;"/></center>

## Neural example

Could we somehow create intermediate features that help us model?

We create four logistic regressions, two horizontal (left) and two vertical (right)
- none of them are accurate, what about together?
<center><img src='fig/neural/circle_direction_likelihood.png' alt="Drawing" style="width: 1200px;"/></center>

## Neural example

When we combine the four models, taking the maximal value of $P(y=red)$:

<center><img src='fig/neural/circle_combine_direction_likelihood.png' alt="Drawing" style="width: 700px;"/></center>

## Neural networks

What happened in the previous example? Manually created a neural network. 
- General structure of an (artificial) neural network, 1 hidden layer:

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch12/images/12_02.png' alt="Drawing" style="width: 750px;"/></center>

(Image from [Raschka, 2017](https://github.com/rasbt/python-machine-learning-book-2nd-edition))

## Neural networks

Can we compare the neural network with a single neuron?
- Yes, the neuron has no hidden layers.
- But what are the intermediate/hidden layer?
  - They consist of $h$ extra auxiliary models that are used to predict the output. 

Why might neural architecture be useful?

## Neural networks

Fundamental results in 80's:

- Methods for estimating neural networks were developed 
- [Cybenko (1989)](https://doi.org/10.1007%2FBF02551274) and [Hornik (1991)](https://doi.org/10.1016%2F0893-6080%2891%2990009-T) demonstrate that neural networks with a **single hidden layer** can universally approximate for $h\rightarrow\infty$

  
Is this important?

- Not at the time - models e.g. non-linear SVM, random forest could be trained with less power for similar accuracy.
- Consequence: popularity of neural networks fell in 1990's and early 2000's.

## Neural networks

Today neural networks have changed computing, e.g. to infer content of images:

<br>
<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg/1200px-Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg' alt="Drawing" style="width: 800px;"/></center>

(author: [MTheiler](https://commons.wikimedia.org/wiki/User:MTheiler), license: CC-BY-SA 4.0)

## Neural networks

What happend since the early 2000's?

Break-through in 2008-12 driven by three factors
- Data not abundant and accessible as today
- Computers not powerful enough, in particular number of cores 
  - .. solved by modern Graphical Processing Units (GPU) for gaming
- Changes in architecture: 
  - Multiple layers rather than 1 hidden layer, known as ***deep learning***
  - Many more advances, e.g. ***convolution*** of layers that divides input into independent parts

## Mechanics of neural networks 

Model works with **feed-forward**: start with input, then proceed through layers


<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch12/images/12_02.png' alt="Drawing" style="width: 750px;"/></center>

(Image from [Raschka, 2017](https://github.com/rasbt/python-machine-learning-book-2nd-edition))

## Mechanics of neural networks 

The models in the both layers consist of linear models with activation, $\phi$:

\begin{align}
a_k^{(hid)} &= \phi\Big(\sum_{i=0}^m a_i^{(in)}\cdot w_{i,k}^{(hid)}\Big), \quad k\in\{1,..,h\} \\
a_l^{(out)} &= \phi\Big(\sum_{i=0}^h a_i^{(hid)}\cdot w_{i,l}^{(out)}\Big), \quad l\in\{1,..,t\} \\
            &= \phi\left(\sum_{i=0}^h \phi\Big(\sum_{j=0}^m a_j^{(in)}\cdot w_{j,k}^{(hid)}\Big)\cdot w_{i,l}^{(out)}\right), \quad k\in\{1,..,h\}, l\in\{1,..,t\} \\
\end{align}

What happens if $\phi$ is linear?

## Mechanics of neural networks 

Estimation of neural network is possible using ***backpropagation***

- Idea, use chain rule: $\frac{\partial y}{\partial x}=\frac{\partial y}{\partial u} \frac{\partial u}{\partial x}$ where $y=f(u)$, $u=g(x)$ (Leibniz notation)
- Fix initial parameters drawn randomly
- Apply gradient descent repeatedly for a given number of epochs as follows
  1. Compute the errors in output stage using feed-forward
  1. Optimize model parameters of output layer given output errors
  1. Optimize model parameters of hidden layer activations given using chain rule 
    - (keeping parameters of output layer fixed)
 

## Mechanics of neural networks 

Drawbacks of neural networks

- Minimal transparency of how input affects output, discrimination?
- Optimization problem is non-linear

<br>
<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Extrema_example.svg/1280px-Extrema_example.svg.png' alt="Drawing" style="width: 450px;"/></center>

(By *KSmrq*, licensed under CC-BY-SA 3.0, [source](https://commons.wikimedia.org/w/index.php?title=User:KSmrq))

## Appliction in economics

Methodological
- [Hartford et al. (2017)](http://proceedings.mlr.press/v70/hartford17a/hartford17a.pdf) develop framework for using neural networks in instrumental variables to parse non-linear, high dimensional treatment 
  - show application to treatment of URL shown in search advertisement
- [Farrell, Liang, Misra (2021)](https://doi.org/10.3982/ECTA16901) develop a generalized framework for using neural networks in estimation and inference
- [Chernozhukov et al. (2017).](https://doi.org/10.1257/aer.p20171038), [Chernozhukov et al. (2018).](https://doi.org/10.1111/ectj.12097) develops a two step framework called Double Machine Learning to apply machine learning in estimation.
  - Allows for applications of machine learning 
 




## Appliction in economics

Data imputation
- Input: Google Street View (360° images)
  - Infer neighborhood safety ([Naik, Raskar & Hidalgo, 2016](https://doi.org/10.1257/aer.p20161030))
  - To measure and track development in sociodemographic compositon ([Gebru et al. 2017](https://doi.org/10.1073/pnas.1700035114); [Nikhil et al. 2017](https://doi.org/10.1073/pnas.1619003114))
- [Layout Parser](https://layout-parser.github.io/),  which can automatically parse the structure of historical records [Zeiyang et al. (2021)](https://arxiv.org/abs/2103.15348)
  - M. Wust and C.M. Dahl use techniques to extract information about health visitors in 60's
  

## Appliction in economics

Reinforcement learning
- Dynamic structural models that are solved approximately.
- Work in progress by colleagues in Copenhagen: Estimate choice models using neural networks. Can increase state space drastically!
-  [Calvano et al. 2020](https://doi.org/10.1257/aer.20190623) demonstrate that pricing algorithms implicitly collude (i.e. coordinate on price setting) even when optimized own pricing
  - How should we regulate price setting online? (and offline)

## Training a neural network

Let's to fit the network with 2 epochs

In [10]:
import pickle
with open(f"../base/data/mnist/mnist.pkl",'rb') as f:
    mnist = pickle.load(f)
X_train,y_train,X_test,y_test = \
    mnist["training_images"], mnist["training_labels"], mnist["test_images"], mnist["test_labels"]

## Training a neural network

We load the code from Raschka chapter 12. Available in this [auxiliary notebook](neural_network_auxiliary.ipynb).

In [11]:
from neuralnet import NeuralNetMLP

clf_nn = {}
for epoch in [2, 20, 200]:
    clf_nn[epoch] = \
        NeuralNetMLP(n_hidden=100,
                     l2=0.01,
                     epochs=epoch,
                     eta=0.0005,
                     minibatch_size=100,
                     shuffle=True,
                     seed=1)

## Training a neural network

Let's to fit the network with 2 epochs

In [12]:
clf_nn[2].fit(X_train=X_train[:55000],
              y_train=y_train[:55000],
              X_valid=X_train[55000:],
              y_valid=y_train[55000:])

2/2 | Cost: 44472.59 | Train/Valid Acc.: 88.74%/91.50% 

<neuralnet.NeuralNetMLP at 0x7fdee89b4f10>

## Training a neural network

Let's to fit the network with 20 epochs

In [13]:
clf_nn[20].fit(X_train=X_train[:55000],
               y_train=y_train[:55000],
               X_valid=X_train[55000:],
               y_valid=y_train[55000:])

20/20 | Cost: 26911.76 | Train/Valid Acc.: 92.89%/94.60% 

<neuralnet.NeuralNetMLP at 0x7fdefb7c6670>

## Training a neural network

Let's to fit the network with 200 epochs

In [14]:
clf_nn[200].fit(X_train=X_train[:55000],
                y_train=y_train[:55000],
                X_valid=X_train[55000:],
                y_valid=y_train[55000:])

200/200 | Cost: 14340.23 | Train/Valid Acc.: 96.47%/96.80% 

<neuralnet.NeuralNetMLP at 0x7fdefb7c6340>

## Training a neural network

What happened to errors?
- Almost halved for ten fold increase in epochs.

# Outro supervised learning

## Outro supervised ML

We have seen that sacrificing 'unbiased' property, we get much stronger models
- General tradeoff: overfitting/underfitting
- Issue: non-linear models are less ***transparent***


## Outro - non-linear ML

Tree based models
- Simple, in particular decision tree
- Ensemble version performed best
- Later in course we extend to allow for estimating causal effects (causal forest and related models)

Kernel/support vector machines
- Less common nowadays, replaced by neural networks

## Outro - non-linear ML

Very big potential for research of neural networks
- Understanding causal effects of non-linear data, e.g.
  - Visual and audible input
  - Human life trajectory
- Understanding consequences of algorithms 
- Potentially for estimation

## Outro - non-linear ML

We have only scratched the surface of neural networks
- Architectures can include:
  - Temporal structure, convolution, regularization, activation
- Deep learning is half engineering/half science 

The world is your oyster
- Deep learning courses are available online and in most universities
- Two big frameworks: pyTorch or TensorFlow
  - May want to invest in GPU!
- Melissa Dell from Harvard has a course called, *Unleashing Novel Data at Scale*, see here for the [reading list](https://dell-research-harvard.github.io/teaching/economics-2355) that may be of interest