# Supervised machine learning
## A recap and overview

# Agenda
- Whether or not to use machine learning
- How machine learning
    - Determine model framework 
    - Implementation details

# Whether to use ML

## When to use a model
*When should I model data?*

Look at your research question. You should use a model when:

1. you try to explain or predict a certain variable, and;

2. a conclusion can be made without a model.
      

Great analyses can come without models: inequality graph in 20th century by Piketty, Saez
      
<center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/2008_Top1percentUSA.png/450px-2008_Top1percentUSA.png' alt="Drawing" style="width: 500px;"/></center>

  

## When to use machine learning
*Machine learning is powerful, when should I use it?*

What is the goal?.


- To make a formal test about model parameters.
    - Use econometrics/OLS.
- You want  a model with good performance.
    - Use machine learning.    
    - But can I still investigate partial effects of a variable? We can visualize it:      
        - individual conditional expectation (ICE).
        - partial dependence plots (PDP).


## What to use machine learning for
*What analysis can you do with machine learning?*

Ask questions like:

- Which model performance is best (with, without reguralarization)?
    - Which level of regularization?
- Does this new group of features help us predict our target?
- Is your model biased/fair? 
    - E.g. you have explicitly included information about minority groups but you can see that minority groups on average are treated differently.

# Determine model framework

## Step A: What problem

What kind of problem am I working on? My target is,
- Continuous:
  - We want to use a regression model
  - We aim for a model with the least mean squared error (MSE).
- Categorical / finite integers:
  - We want to use a classification model
  - We aim for a model with highest accuracy (ACC).


## Step B: Which model
Depending on the problem we pick a specific model. If regression, pick:
- A linear model (least squares, lasso, ridge)

If classification use:
- Logistic regression (w/ or w/o regularization)

Note: You are welcome to try out other more complicated models. Note:
- You are not given any points for this.
- Be sure to explain how the models works in your project.


## Step C: Determine hyperparameters
What hyperparameters exist for the model I have chosen?
  - Ridge/Lasso: $\lambda$;
  - Elastic net: $\lambda_1$, $\lambda_2$;
  - Logistic regression with L1 or L2 reg.


# Applying supervised machine learning




## Step 1: data split

Split into test and development (train) data.
- A normal split is 30 pct. for test and 70 pct. for train if you have ~ observations.
- If you have more than 10,000 observations then use 20 pct. for test, 80 pct. for train.

Polynomial transformation of features:
  - This step is optional - only makes sense for linear and logistic models (e.g. lasso)


## Step 2: model pipeline
Construct a model building pipeline.
- Preprocessing phase
  - Preprocessing: polynomial expansion, variable scaling (optional)
- Supervised model (classification or regression)

Note: It is optional on whether to use `make_pipeline`. We recommend using it as there will be fewer mistakes and contain less code.


## Step 3: model selection (1)

Main idea:
- We want to select the optimal model.
- We measure model performance with out-of-sample prediction on validation data.

Implementation:
- Pick the model which performed the best on the validation data during cross validation.

## Step 3: model selection (2)

Cross validation (CV):
- We only use the training/development data.
- We use 10 fold CV and split this data into 10 even sized validation bins.
- For each validation bin:.
  - We fit our model on the data outside the validation bin, i.e. in one of the remaining 9 bins..
  - We transform and predict the target in the validation bin using our model.
  - Note: we must perform our whole model building process and transformation in each fold.

Finally we compute the mean across the 10 validation bins for each hyperparameter combination we are testing and pick the one that maximize out-of-sample performance.


## Step 4: check list

- Check that we have NO data leakage
    - I.e. NEVER fit our model building on validation/test data.
- Ensure that model has converged (do not suppress warning)
- Am I using a static model and splitting for time series?

## Step 5: final model training and evaluation

We train the model with the optimal hyperparameters on ALL the training/development data.

Evaluate the model out-of-sample on the test set.


## A note on analysis with ML models

Are you interested in testing whether certain models or features are predictive? 
- you can use nested cross validation to help you gauge uncertainty of estimates
    - formalized in advanced course Social Data Science - Econometrics and Machine Learning, see [here](https://github.com/abjer/sds_eml_2020/blob/master/material/session_1/lecture1.ipynb)
- or simply use cross validation at the outer level but keep the model fixed 
    - (i.e. hyperparameters are constant)