In [None]:
Code:
    
1. Install packages + read data 
2. Generate training set (split data)
3. Generate model
4. Train the model 


# Disposition

1. Data collection
2. Data preparation
3. Choosing a model
4. Training the model
5. Evaluate the model
6. Parameter tuning -> hyperparameters
7. Prediction


**1. Our data:**
   - Is it categorical or contineous?
       * Continous data: their possible values cannot be counted and can only be described using intervals on the real number line. For example, the exact amount of gas purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to 20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41, or 8.414863 gallons, or any possible number from 0 to 20. In this way, continuous data can be thought of as being uncountably infinite
       * Categorical data: Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. Similarly movie, music and video game genres, weather, country names, food and cuisine types are other examples of nominal categorical attributes.
       
**2. Models:**     
We would like to predict the restaurants rating from different factors such as; weather, location, cuisine, price, month ....
Model: 
score = beta_0 + beta_1*price+beta_2*location + ... + error


- Linear regression
- Lasso
- Ridge


# Module 10 - Modeling and machine learning

Suppose we have some data $y$ we want to model/predict from input $x$.  

The aim is to find a function $f$ such that the distance between actual values $y$ and predicted values $f(x)$ are minimized.

Suppose we have model $y=x^T\beta$

We distinguish by type of the `target` variable `y`:
- **regression**: predict a numeric value
- **classification**: distinguish between target categories (non-numeric data)

## Definitions

ML lingo and econometric equivalents (in italic)

- `feature` vector, $\textbf{x}_i$, i.e a row of input variables
  - = explanatory *variables* in econometrics
- `weight` vector, $\textbf{w}$, i.e model parameters
  - = *coefficients* in econometrics where denoted $\beta$
- `bias` term, $w_0$, i.e. the model intercept
  - = the *constant* variable in denoted $\beta_0$
  

How do we estimate the model parameters?

Step 1. initialize the weights,  $\hat{w}$, with small random numbers

Step 2. for each (training) observation, i=1, .., n
  1. compute predicted target, $\hat{y}_i$
  1. update weights $\hat{w}$ based on perceptron rule (explanation follows)

How do we compute the predicted target $\hat{y}$?*

We apply a transformation on the net-input :
- single observation, expanded notation:
\begin{align*}
\hat{y}_i= \phi(z_i),\quad z_i=w_0+w_1x_{i,1}+...+w_kx_{i,k}
\end{align*}

- single observation, vector notation:
\begin{align*}
\hat{y}_i= \phi(z_i),\quad z_i=\boldsymbol{w}^{T}\boldsymbol{x}_i
\end{align*}

- multiple observations, matrix notation:
\begin{align*}
\hat{\boldsymbol{y}}= & \phi(\boldsymbol{z}),\quad\boldsymbol{z}=\boldsymbol{X}\boldsymbol{w}
\end{align*}

- Idea: Use some of our sample for model evaluation.
- Implementation - divide data randomly into two subsets:
    - `training data` for estimation; 
    - `test data` for evaluation.
- Note: does not work for time series.



# Module 11 - Regression and regularization 

## Two agendas (1)

What are the objectives of empirical research? 

1. *causation*: what is the effect of a particular variable on an outcome? 
2. *prediction*: find some function that provides a good prediction of $y$ as a function of $x$

## Two agendas (2)

How might we express the agendas in a model?

$$ y = \alpha + \beta x + \varepsilon $$

- *causation*: interested in $\hat{\beta}$ 

- *prediction*: interested in $\hat{y}$ 

## Fitting a polynomial (1)
Polyonomial: $f(x) = 2+8*x^4$

Try models of increasing order polynomials. 

- Split data into train and test (50/50)
- For polynomial order 0 to 9:
    - Iteration n: $y = \sum_{k=0}^{n}(\beta_k\cdot x^k)+\varepsilon$. (Taylor expansion)
    - Estimate order n model on training data
    - Evaluate with on test data with $\log RMSE$ ($= \log \sqrt{SSE/n}$)b

Why do we regularize?
- To mitigate overfitting > better model predictions

How do we regularize?
- We make models which are less complex:
  - reducing the **number** of coefficient;
  - reducing the **size** of the coefficients.b

We add a penalty term our optimization procedure:
    
$$ \text{arg min}_\beta \, \underset{\text{MSE=SSE/n}}{\underbrace{E[(y_0 - \hat{f}(x_0))^2]}} + \underset{\text{penalty}}{\underbrace{\lambda \cdot R(\beta)}}$$

Introduction of penalties implies that increased model complexity has to be met with high increases precision of estimates.

The two most common penalty functions are L1 and L2 regularization.

- L1 regularization (***Lasso***): $R(\beta)=\sum_{j=1}^{p}|\beta_j|$ 
    - Makes coefficients sparse, i.e. selects variables by removing some (if $\lambda$ is high)
    
    
- L2 regularization (***Ridge***): $R(\beta)=\sum_{j=1}^{p}\beta_j^2$
    - Reduce coefficient size
    - Fast due to analytical solution
    
*To note:* The *Elastic Net* uses a combination of L1 and L2 regularization.

# Module 12 - Module selection and cross-validation

From [Wikipedia](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) 2019:

- model **bias**: _an error from erroneous assumptions in the learning algorithm_
  - high bias can cause an algorithm to miss the relevant relations between features and target outputs (**underfitting**)
   

- model **variance**: _an error from sensitivity to small fluctuations in the training set_
  -  high variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (**overfitting**).
 

Overfitting is: low bias / high variance

- traning our model captures all patterns but we also find some irrelevant
- reacts too much to training sample errors 
    - some errors are just noise, and thus we find too many spurious relations 
- examples of causes: 
    - too much polynomial expansion of variables (`PolynomialFeatures`)
    - non-linear/logistic without properly tuned hyperparameters: 
        - Decision Trees, Support Vector Machines or Neural Networks

Underfitting is: high bias / low variance
- oversimplification of models, cannot approximate all patterns found
- examples of causes: 
    - linear and logistic regression (without polynomial expansion)

**The hold-out method:**
We reuse the data in the development set repeatedly
- We test on all the data
- Rotate which parts of data is used for test and train.

**Leave-one-out CV**
The most robust approach
- Each single observation in the training data we use the remaining data to train.
- Makes number of models equal to the number of observations
- Very computing intensive - does not scale!
LOOCV

**K-fold method**
We split the sample into $K$ even sized test bins.
- For each test bin $k$ we use the remaining data for training.

Advantages:
- We use all our data for testing.
- Training is done with 100-(100/K) pct. of the data, i.e. 90 pct. for K=10.

We compute MSE for every lambda and every fold (nested for loop)

- Getting more evaluations of our model performance.
- We can cross validate at two levels:
    - Outer: we make multiple splits of test and train/dev.
    - Inner: within each train/dev. dataset we make cross validation to choose hyperparameters