> git clone https://github.com/dsahduke/training_test_splits.git

Download dataset: https://duke.box.com/s/uvjhg4i6umrytkvzadusfbn9w4cg3znm

In [None]:
import pandas as pd

In [None]:
model_matrix = pd.read_csv("./data/model_matrix.csv")

In [None]:
model_matrix.head()

## Recap


So far, we've described the general problem of supervised learning, where we have data $x$ and label $y$ and we want to model f(x) -> y.

If y is a continuous variable, this is known as regression, and if y is discrete, then the problem is known as classification.

In addition, we've seen that the general structure of these problems is to define the form of the model with some parameters, and then to estimate the best values of those parameters given a loss function.

So if we have formulated a supervised learning problem, and we have the data in a *tidy* format (1 observation per row, 1 variable per column), we can begin to fit the model.

For example, let's say that we have the following ridge regression example:

$$ \hat{\beta} = \underset{\beta}{\arg\min} \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^{p}\beta_j^2$$

Our goal here is to estimate the *parameters* of the model, which are the $\beta$ values. We have seen that we can fit these types of models using gradient descent, although there are other algorithms which exist to solve these types of problems. 

However, there are a few different approaches even once you have all of this information. Namely:
  * How do you estimate your test error? 
  * How do you structure your training and testing splits? 
  * How do you pick a value for $\lambda$?

### Hyperparameters

A **hyperparameter** is a parameter that is set before any model-fitting actually occurs. In the ridge regression and LASSO setting, $\lambda$ or the regularization parameters, would be a hyperparameter.

This means, *prior* to using gradient descent or some other method for estimating the $\beta$ coefficients, we must first choose a value of $\lambda$. In practice, hyperparameters do not even have to be explicit parameters in the model. For example, we can think of a hyperparameter as being whether or not we choose to standardize the covariates before starting the fitting procedure. 


### 2 Simultaneous problems

1. Estimate $\beta$ coefficients.
2. Select a good value of $\lambda$.

# Goal 

Stated generally, the goal of splitting your data into training/test splits or doing cross-validation, etc. is:

 * Finding the right hyperparameter settings for your models
 * Estimating your overall performance (out of sample)

All of the different variations of splitting your data are an attempt to accomplish these two goals. However, due to computational reasons / sample sizes, the way you get there might be different

## Train/Test Splits

The most common approach in machine learning to estimating how well your model works is to use the Training and Test set procedure.

Often, this is done the following way:

  1. Randomly select a percentage (usually ~80%) of your dataset, and use those values of training
  2. Evaluate the model on the remaining ~20% of the data to test model performance (accuracy, MSE, etc.)


#### Pros:
 * easy to implement

#### Cons:
 * the specific examples that are in the training set and testing set may be biased in some way
 * only use some of the data to fit the model - depending on the size of your data, can overestimate error
 
#### Practical Considerations:
 * What percentage of data to choose?
 * How to assess differences in training vs. test?

## Cross-Validation

Cross-Validation is a technique that addresses some of the downsides of training/test splits. 

Instead of splitting the data once, cross-validation splits the data many times and averages the error over the splits.

## k-Fold

The most common form of cross-validation is known as *k-fold* validation. In *k-fold* cross-validation, the data is split into $k$ evenly-sized groups, where $k$ is usually 5 or 10. The general idea is to fit the model on k-1 groups, and then test on the last group. Then, you cycle the groups and average the error over the groups (folds).

The Cross-Validated error is estimated by:

$$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i $$

![](./assets/kfold.png)

## LOOCV

In the extreme case, where the number of folds is equal to the number of data points, this process is known as *leave-one-out* cross-validation. The scheme then looks like this:

![](./assets/loocv.png)

#### Pros:
 * Use almost all the data to train the model
 * Repeated iterations will always result in the same answer
 
#### Cons:
 * Computationally very expensive

Also a bias-variance tradeoff! Although LOOCV has lower bias, it has a much higher variance. This is due to the fact that each of the models are much more correlated with one another (they only differ by 1 data point), and variances of samples of highly correlated variables are generally higher than those that are less correlated.

[Explanation](https://stats.stackexchange.com/a/223461)

# What about hyperparameters?

If we used a training/test split, and we wanted to vary $\lambda$ in a LASSO setting, One procedure could be testing different values of $\lambda$ fit on the training set and evaluated on the test set.

## Overfitting

The problem with this method is that you may overfit to your test set. The more times you evaluate a model on the same test set, the higher the chance that you obtain a good result by chance. This is analogous to the multiple comparison problem in hypothesis testing.

Case study: [Kaggle](http://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/)

## Validation Set

In this case, is there is enough data, one strategy would be to use a validation set in addition to a testing set.

In this procedure, we have 3 splits: a training split, validation split, and test split. 

The general procedure is as follows:

 1. Use the training split and evaluate different values of hyperparameters on the validation split. 
 2. Whatever the best values of the hyperparameters are, select those and train on the combined training and validation split. 
 3. Evaluate the performance **once** on the test set
 
In practice, steps 1 and 2 are usually done in a cross-validation setting. This is another purpose of cross-validation

In the cross-validation setting, here is the correct procedure: 

 1. Split your data into a Training set and a Test set
 2. Split your Training set into a smaller training set and validation set
 3. Evaluate different models on the validation set and record the error
 4. Repeat steps 2-4 with different splits of smaller training and validation sets until every data point has been in the validation set once (and only once)
 5. Take the parameter values which performed the best (average error/some other metric) and use that to *refit* the model on the entire larger training set
 6. Evaluate on the test set 
 
Deviations from procedure can be taken depending on factors such as size of the data, computation time, use of the model, etc.

## Healthcare considerations

In healthcare, there are other considerations that may play a part in how to decide on splitting the data

#### Generalization over time

Oftentimes, if we train data on a particular time period, we want to evaluate how well it works in a different time setting. Research has shown that models trained on one time period often deteriorate over time, either due to shifting populations or other dynamic factors.

In this setting, it may be useful to consider training and test splits that occur over time. For example, train on 12 months of data and assess performance on subsequent 12 months. Alternatively, you can use a scheme where your splits include more and more data.

![](./assets/time-based-crossval.png)

#### Generalization by site

Models trained on one population often do not generalize to other populations. It is critical to validate a model on other patient populations, which is a standard in clinical care. In these settings, it may make sense to have a training set of one patient population and then test on another population.

This is also an area of active research

## Deployment considerations

Deploying models into production has a whole host of issues associated with it. The most important consideration is that the data that you are serving the model on is similar to the data that you trained on.

In addition, data that is available at serve time should determine the features that you use in your model. If data will only be available after the model needs to be run, it does not make sense to include in the model.

We will cover common problems with model deployment in a future lecture

## SKlearn, LASSO, and Cross-Validation

## In-class Exercise

In this exercise, we will fit our data using the train/test method, the train/test/validation method, and the cross-validation with test method to compare our results. We will be using the model dataset provided and try to predict the last column: `readmission`

In [None]:
model_matrix.head()

### Fit a logistic regression model

The goal here will be to fit a *l2-penalized* logistic regression (since our outcome is binary). Specifically, we want to find the right value of $\lambda$ that minimizes the test error. 

$$ loss = - \frac{1}{N}(\sum_{i=1}^{N}y^{(i)} \log \sigma(\boldsymbol{\beta}^T\textbf{x}^{(i)}) + (1 - y^{(i)}) \log (1 - \sigma(\boldsymbol{\beta}^T\textbf{x}^{(i)}))) + \lambda \sum_{j}^{P}\beta_j^2$$

### Train/test split

Randomly select 20% of the rows to use as the test set. Split your data into the following variables:

 * `X_train`: Train Design matrix without the readmission label
 * `X_test`: Test Design matrix without the readmission label
 * `y_train`: Train label (same length as number of rows of `X_train`
 * `y_test`: Test label (same length as number of rows of `X_test`
 
Do this by randomly choosing row numbers. Do not use sklearn's built-in `train_test_split` method. You may want to use the `np.random.choice`, `np.random.permutation`, `math.floor` or other functions. 

In [None]:
model_matrix = model_matrix.values

In [None]:
import numpy as np
import math

#### Fit the model.. note how it is difficult to choose the right values of lambda (or 1/C, in sklearn)

> class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression(penalty='l2', C=0.01)
# Fit the model
lr.fit(X_train, y_train)
predictions = lr.predict_proba(X_test)

In [None]:
from sklearn.metrics import log_loss, roc_auc_score

In [None]:
log_loss(y_test, predictions[:, 1])

In [None]:
roc_auc_score(y_test, predictions[:, 1])

### Train/test/validation 

Using the same logic as above, create a train test and validation split (70%/10%/20%). Now, find the best value of C based on the `log_loss` (lower is better) or `roc_auc_score` (higher is better) on the validation split, and then use that model to fit onto the test set. 

In [None]:
X_train = 
X_valid = 
X_test = 
y_train = 
y_valid = 
y_test = 

In [None]:
C_list = [0.001, 0.01, 0.1, 1, 10]

### Cross-validation 

Now, apply the same procedure, but instead of just having 1 validation set, write a function that splits the non-test set indices into *k* separate folds. Run the model in *each* fold, for *each* value of C. Then, pick the best model, *refit* on the entire training set (non-test), and evaluate on the final test set.