## Recap


So far, we've described the general problem of supervised learning, where we have data $x$ and label $y$ and we want to model f(x) -> y.

If y is a continuous variable, this is known as regression, and if y is discrete, then the problem is known as classification.

In addition, we've seen that the general structure of these problems is to define the form of the model with some parameters, and then to estimate the best values of those parameters given a loss function.

So if we have formulated a supervised learning problem, and we have the data in a *tidy* format (1 observation per row, 1 variable per column), we can begin to fit the model.

For example, let's say that we have the following ridge regression example:

$$ \hat{\beta} = \underset{\beta}{\arg\min} \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^{p}\beta_j^2$$

Our goal here is to estimate the *parameters* of the model, which are the $\beta$ values. We have seen that we can fit these types of models using gradient descent, although there are other algorithms which exist to solve these types of problems. 

However, there are a few different approaches even once you have all of this information. Namely:
  * How do you estimate your test error? 
  * How do you structure your training and testing splits? 
  * How do you pick a value for $\lambda$?

### Hyperparameters

A **hyperparameter** is a parameter that is set before any model-fitting actually occurs. In the ridge regression and LASSO setting, $\lambda$ or the regularization parameters, would be a hyperparameter.

This means, *prior* to using gradient descent or some other method for estimating the $\beta$ coefficients, we must first choose a value of $\lambda$. In practice, hyperparameters do not even have to be explicit parameters in the model. For example, we can think of a hyperparameter as being whether or not we choose to standardize the covariates before starting the fitting procedure. 


### 2 Simultaneous problems

1. Estimate $\beta$ coefficients.
2. Select a good value of $\lambda$.

## Train/Test Splits

The most common approach in machine learning to estimating how well your model works is to use the Training and Test set procedure.

Often, this is done the following way:

  1. Randomly select a percentage (usually ~80%) of your dataset, and use those values of training
  2. Evaluate the model on the remaining ~20% of the data to test model performance (accuracy, MSE, etc.)


#### Pros:
 * easy to implement

#### Cons:
 * the specific examples that are in the training set and testing set may be biased in some way
 * only use some of the data to fit the model - depending on the size of your data, can overestimate error
 
#### Practical Considerations:
 * What percentage of data to choose?
 * How to assess differences in training vs. test?

## Cross-Validation

Cross-Validation is a technique that addresses some of the downsides of training/test splits. 

Instead of splitting the data once, cross-validation splits the data many times and averages the error over the splits.

## k-Fold

The most common form of cross-validation is known as *k-fold* validation. In *k-fold* cross-validation, the data is split into $k$ evenly-sized groups, where $k$ is usually 5 or 10. The general idea is to fit the model on k-1 groups, and then test on the last group. Then, you cycle the groups and average the error over the groups (folds).

The Cross-Validated error is estimated by:

$$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i $$

![](./assets/kfold.png)

## LOOCV

In the extreme case, where the number of folds is equal to the number of data points, this process is known as *leave-one-out* cross-validation. The scheme then looks like this:

![](./assets/loocv.png)

#### Pros:
 * Use almost all the data to train the model
 * Repeated iterations wil always result in the same answer
 
#### Cons:
 * Computationally very expensive

Also a bias-variance tradeoff! Although LOOCV has lower bias, it has a much higher variance. This is due to the fact that each of the models are much more correlated with one another (they only differ by 1 data point), and variances of samples of highly correlated variables are generally higher than those that are less correlated.

[Explanation](https://stats.stackexchange.com/a/223461)

# What about hyperparameters?

If we used a training/test split, and we wanted to vary $\lambda$ in a LASSO setting, One procedure could be testing different values of $\lambda$ fit on the training set and evaluated on the test set.

## Overfitting

The problem with this method is that you may overfit to your test set. The more times you evaluate a model on the same test set, the higher the chance that you obtain a good result by chance. This is analogous to the multiple comparison problem in hypothesis testing.

Case study: [Kaggle](http://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/)

## Validation Set

In this case, is there is enough data, one strategy would be to use a validation set in addition to a testing set.

In this procedure, we have 3 splits: a training split, validation split, and test split. 

The general procedure is as follows:

 1. Use the training split and evaluate different values of hyperparameters on the validation split. 
 2. Whatever the best values of the hyperparameters are, select those and train on the combined training and validation split. 
 3. Evaluate the performance **once** on the test set
 
In practice, steps 1 and 2 are usually done in a cross-validation setting. This is another purpose of cross-validation

## Healthcare considerations

In healthcare, there are other considerations that may play a part in how to decide on splitting the data

#### Generalization over time

Oftentimes, if we train data on a particular time period, we want to evaluate how well it works in a different time setting. Research has shown that models trained on one time period often deteriorate over time, either due to shifting populations or other dynamic factors.

In this setting, it may be useful to consider training and test splits that occur over time. For example, train on 12 months of data and assess performance on subsequent 12 months.

#### Generalization by site

Models trained on one population often do not generalize to other populations. It is critical to validate a model on other patient populations, which is a standard in clinical care. In these settings, it may make sense to have a training set of one patient population and then test on another population.

This is also an area of active research

## Deployment considerations

Deploying models into production has a whole host of issues associated with it. The most important consideration is that the data that you are serving the model on is similar to the data that you trained on.

In addition, data that is available at serve time should determine the features that you use in your model. If data will only be available after the model needs to be run, it does not make sense to include in the model.

We will cover common problems with model deployment in a future lecture

## SKlearn, LASSO, and Cross-Validation

## In-class Exercise

The following code builds a design/model matrix with the diabetes data

In [None]:
import pandas as pd
diabetes_df = pd.read_csv('./data/diabetes_df.csv')

In [None]:
model_subset = diabetes_df.loc[:, ['encounter_id', 'race', 'gender', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']]

In [None]:
id_subset = diabetes_df.loc[:, ['encounter_id', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id']]

In [None]:
ccs_subset = diabetes_df.loc[:, ['encounter_id', 'CCS Category Description 1', 'CCS Category Description 2', 'CCS Category Description 3']]

In [None]:
model_subset = pd.get_dummies(model_subset, prefix = "ind_", dummy_na = True, drop_first = True)

In [None]:
id_subset = pd.get_dummies(id_subset, columns = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id'],prefix = {x:x for x in id_subset.columns if x is not 'encounter_id'}, dummy_na = True, drop_first = True)

In [None]:
ccs_subset = pd.get_dummies(ccs_subset, prefix = "ind_", dummy_na = True, drop_first = True)

#### Combine repeated columns

In [None]:
import numpy as np

In [None]:
ccs_subset = ccs_subset.groupby(ccs_subset.columns, axis = 1).sum()

#### Join all dataframes together

In [None]:
outcome = diabetes_df.loc[:, 'time_in_hospital']

In [None]:
model_dataset = (model_subset.merge(id_subset, how = "left", on = "encounter_id")
                             .merge(ccs_subset, how = "left", on = "encounter_id")
                )

In [None]:
model_matrix = model_dataset.drop('encounter_id', axis = 1).values

In [None]:
model_matrix

In [None]:
model_matrix.shape

In [None]:
outcome.shape

## Fit a LASSO using cross-validation

In [None]:
from sklearn import linear_model, model_selection

The goal here will be to fit a LASSO Linear Regression. Specifically, we want to find the right value of $\lambda$ that minimizes the test MSE.

$$ \hat{\beta} = \underset{\beta}{\arg\min} \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^{p}|\beta_j|$$

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2 $$

### Split the data into a training and test split

Make sure to randomly select rows. You can use scikit-learn's helper functions or write your own. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

### Use the Cross-validated LASSO in Sci-kit learn to fit the model to the data

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

## Extract the best value for $\lambda$

Use this value to fit the final model on the combined training set and evaluate the performance on the test set

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

### How many non-zero coefficients are in the final model?

Recall that the LASSO penalty will shrink some coefficients to exactly 0.