In [8]:
import numpy as np
import pandas as pd
from IPython.display import Image
Image(url='http://datascience.uci.edu/wp-content/uploads/sites/2/2014/09/data_science_logo_with_image1.png')

In [67]:
# First, let's define a few helper functions to use later.  
# Don't worry about trying to pick these apart yet

def l2_error(y_true, y_pred):
    """
    calculate the sum of squared errors (i.e. "L2 error") 
    given a vector of true ys and a vector of predicted ys
    """
    diff = (y_true-y_pred)
    return np.sqrt(np.dot(diff, diff))

def calc_train_test_errors(untrained_model, x_train, x_test, y_train, y_test):
    """
    calculate the training and testing l2 error of a model
    """
    model = untrained_model
    model.fit(x_train, y_train)
    y_predicted_train = model.predict(x_train)
    y_predicted_test = model.predict(x_test)

    train_error = l2_error(y_train, y_predicted_train)
    test_error = l2_error(y_test, y_predicted_test)

    model_name = get_model_name(model)
    
    errors = pd.DataFrame(data=[[model_name, train_error, test_error]], 
                          columns=['Model Name', 'Train Error', 'Test Error'])
    
    return errors

def calc_true_and_predicted_ys(trained_model, x, y_true):
    y_predicted = trained_model.predict(x)    

    diff = y_true - y_predicted
    diff_squared = diff * diff
    
    ys_df = pd.DataFrame(data=np.vstack([y_true, y_predicted, diff_squared]).T, 
                         columns=['True y', 'Predicted y', 'Squared Error'])
    
    return ys_df    

def get_model_name(model):
    s = model.__str__().lower()
    if "linearregression" in s:
        return 'LinearRegression'
    elif "lasso" in s:
        return 'Lasso(a=%g)' % model.alpha
    elif "ridge" in s:
        return 'Ridge(a=%g)' % model.alpha
    elif "elastic" in s:
        return 'ElasticNet(a=%g, r=%g)' % (model.alpha, model.l1_ratio)
    else:
        raise ValueError("Unknown Model Type")

## Review From Before Lunch:

We created a linear model and saw that it performed well on half the data but poorly on the other half.

In [71]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

random_seed = 2

with np.load('data/mystery_data.npz') as data:
    x = data['X']
    y = data['y']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=random_seed)

model = linear_model.LinearRegression()
errors = calc_train_test_errors(model, x_train, x_test, y_train, y_test)
print errors
print 
# train model on training data
trained_model = model
trained_model.fit(x_train, y_train)

# get predictions from model trained on training data for both training and test data
ys_train_df = calc_true_and_predicted_ys(trained_model, x_train, y_train)
ys_test_df = calc_true_and_predicted_ys(trained_model, x_test, y_test)

print 'Predictions on train data:'
print ys_train_df.head()
print 
print 'Predictions on test data:'
print ys_test_df.head()


         Model Name  Train Error  Test Error
0  LinearRegression    18.126283   24.173135

Predictions on train data:
     True y  Predicted y  Squared Error
0  0.977557     1.711623       0.538853
1 -1.900613    -0.983559       0.840987
2  7.957678     8.284945       0.107103
3  1.817175     1.706441       0.012262
4 -3.091820    -3.502911       0.168996

Predictions on test data:
     True y  Predicted y  Squared Error
0  7.576276     5.509231       4.272678
1  6.156291     5.822688       0.111291
2  5.806589     5.616300       0.036210
3  8.961585     7.302489       2.752600
4  3.522643     3.649916       0.016198


# Why the Failure?  We're Overfitting

### Overfitting in Pictures


In [4]:
Image(url='http://radimrehurek.com/data_science_python/plot_bias_variance_examples_2.png')

In [5]:
Image(url='http://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg')

## How to Prevent Overfitting?
Ultimately we don't want to build a model which performs well on data we've already seen, we want to build a model which will perform well on data we haven't seen.

There are two linked strategies for to accomplish this: regularization and model selection.

## Regularization
###Linear Regression Loss Function
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2 \\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 \\   
\end{eqnarray*}

###L2 Regularized Linear Regression Loss Function -- "Ridge"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha ||\beta||_2^2\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \beta^T \beta\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \sum_{d=1}^D \beta_d^2\\
\end{eqnarray*}

In [17]:
from sklearn import linear_model
clf = linear_model.Ridge (alpha = .5)
# TODO: ridge regression on dataset

###L1 Regularized Linear Regression Loss Function -- "LASSO"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha ||\beta||_1\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \sum_{d=1}^D \beta_d\\
\end{eqnarray*}

In [13]:
# TODO: LASSO on dataset
clf = linear_model.Lasso(alpha = 0.1)

# TODO: interpretation, look at betas, plot betas
# TODO: the L1 diamond

How does this affect our $\beta$ vector

###L1 + L2 Regularized Linear Regression Loss Function -- "ElasticNet"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{2N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \rho ||\beta||_1 + \frac{\alpha (1 - \rho)}{2} ||\beta||_2^2\\\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \rho \sum_{d=1}^D \beta_d + \frac{\alpha (1 - \rho)}{2} \sum_{d=1}^D \beta_d^2\\
\end{eqnarray*}

In [None]:
from sklearn import linear_model
clf = linear_model.ElasticNet(alpha=1.0, l1_ratio=0.5)

##How to Choose $\alpha$ and/or $\rho$?  Cross Validation
There are many forms of cross validation.  The basic idea of each is to _train_ your model on some data and _estimate it's future performance_ on other data.

## Types of Cross Validation
### Validation Set Cross Validation
1. Pick an amount of data to be in your validation set (e.g. 10%)
2. Randomly split datapoints into training points (90%) and validation points (10%)
3. Train your model on the training data
4. Test your model on the validation data, record the validation error
5. Estimated future errors is the validation error


* **Good:** Easy and computationally cheap
* **Bad:** Statistically noisy and wastes data

### Leave One Out Cross Validation
1. For each datapoint in the training set
  1. Train the model on all data except that datapoint
  2. Record your error measure on the chosen datapoint
2. Estimated future error is total error across all the datapoints 


* **Good:** Doesn't waste data
* **Bad:** Computationally expensive

### K-Fold Cross Validation
1. Partition the data into K folds
2. For each fold k:
  1. Train the model on all your data except the data in k
  2. Record the error on the the data in k
3. Estimate future error as total error across all folds


* **Good:** Computationally cheaper than leave one out cross validation, only wastes 100/k% of the data
* **Bad:** k times as expensive as just training one model, wastes 100/k% of the data

In [8]:
Image(url='https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png')

## Hyperparameter Selection with Cross Validation
1. For each model:
  1. Estimate the model's performance on future data using cross validation
2. Pick the model with the best estimated future performance
3. Train the best model from scratch on the full dataset.  This is your model

In [15]:
# TODO: CV with data
from sklearn import linear_model
clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
clf = linear_model.LassoCV(alphas=[0.1, 1.0, 10.0])

## Caveats:
* You can still overfit with intensive cross validation!
* But it's much better than without

## Summary:
* **The Central Thesis of Machine Learning:** We're only interested in predictive performance on unseen data, NOT seen data.
* Training set error estimates error on **seen** data
* Cross validation error estimates error on **unseen** data
* Regularization is a way to improve your error on unseen data, but it introduces new hyperparameters
* Use a cross validated estimate of future performance to choose your model / hyperparameter settings