In [6]:
import numpy as np
import pandas as pd
from IPython.display import Image
Image(url='http://datascience.uci.edu/wp-content/uploads/sites/2/2014/09/data_science_logo_with_image1.png')

In [7]:
# First, let's define a few helper functions to use later.  
# Don't worry about trying to pick these apart yet

def mean_squared_error(y_true, y_pred):
    """
    calculate the mean_squared_error given a vector of true ys and a vector of predicted ys
    """
    diff = y_true - y_pred
    return np.sqrt(np.dot(diff, diff)) / len(diff)

def calc_train_test_errors(untrained_model, x_train, x_test, y_train, y_test):
    """
    calculate the training and testing l2 error of a model
    """
    model = untrained_model
    model.fit(x_train, y_train)
    y_predicted_train = model.predict(x_train)
    y_predicted_test = model.predict(x_test)

    train_error = mean_squared_error(y_train, y_predicted_train)
    test_error = mean_squared_error(y_test, y_predicted_test)

    model_name = get_model_name(model)
    
    errors = pd.DataFrame(data=[[model_name, train_error, test_error]], 
                          columns=['Model Name', 'Train Error', 'Test Error'])
    
    return errors

def calc_true_and_predicted_ys(trained_model, x, y_true):
    y_predicted = trained_model.predict(x)    

    diff = y_true - y_predicted
    diff_squared = diff * diff
    
    ys_df = pd.DataFrame(data=np.vstack([y_true, y_predicted, diff_squared]).T, 
                         columns=['True y', 'Predicted y', 'Squared Error'])
    
    return ys_df    

def get_model_name(model):
    s = model.__str__().lower()
    if "linearregression" in s:
        return 'LinearRegression'
    elif "lasso" in s:
        return 'Lasso(a=%g)' % model.alpha
    elif "ridge" in s:
        return 'Ridge(a=%g)' % model.alpha
    elif "elastic" in s:
        return 'ElasticNet(a=%g, r=%g)' % (model.alpha, model.l1_ratio)
    else:
        raise ValueError("Unknown Model Type")

## Review From Before Lunch:
We created a linear model and saw that it performed well on half the data but poorly on the other half.

In [23]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

# load data
with np.load('data/mystery_data_old.npz') as data:
    celeb_data_old = data['celeb_data_old']
    popularity_old = data['popularity_old']
    celeb_data_new = data['celeb_data_new']

# fit a model
lmr3 = linear_model.LinearRegression()
lmr3.fit(celeb_data_old, popularity_old)

# predict popularity for old data
predicted_popularity_old = lmr3.predict(celeb_data_old)

# predict popularity for new data
predicted_popularity_new = lmr3.predict(celeb_data_new)

print "Mean Squared Error on Past Data: ", mean_squared_error(popularity_old, predicted_popularity_old)
print 

with np.load('data/mystery_data_new.npz') as data:
    popularity_new = data['popularity_new']

print "Mean Squared Error on Future Data:", mean_squared_error(popularity_new, predicted_popularity_new)

Mean Squared Error on Past Data:  0.0453157064018

Mean Squared Error on Future Data: 0.0604328385826


## Predictive Modeling
What we saw above is a common setup.  We have $\mathbf{X}$ and $\mathbf{y}$ data from the past and $\mathbf{X}$ data for the present for which we want to **predict** the future $\mathbf{y}$ values.

We can generalize this notion of past / present data into what's generally called train / test data.

* **Training Data** -- A dataset that we use to train our model.  We have both $\mathbf{X}$ and $\mathbf{y}$
* **Testing Data** -- A dataset which only has $\mathbf{X}$ values and for which we need to predict $\mathbf{y}$ values.  We might also have access to the real $\mathbf{y}$ values so that we can test how well our model will do on data it hasn't seen before.

### <span style="color:red">Model Fitting Exercise: 10 Minutes</span>
1. Partner up.  On one computer:
  1. Write a function with the call signature `predict_test_values(model, X_train, y_train, X_test)` where model is a scikit learn model
    1. Fit the model on `X_train` and `y_train`
    1. Predict the y values for `X_test`
    1. Return a vector of predicted y values
  1. Write a second function with the call signature `calc_train_and_test_error(model, X_train, y_train, X_test, y_test)`
    1. Fit the model on `X_train` and `y_train`
    1. Predict the y values for `X_test`
    1. Predict the y values for `X_train`
    1. Calculate the `mean_squared_error` on both the train and test data.
    1. Return the train error and test error
  1. Describe to your partner the situations in which you might use each function

In [18]:
def mean_squared_error(y_true, y_pred):
    """
    calculate the mean_squared_error given a vector of true ys and a vector of predicted ys
    """
    diff = y_true - y_pred
    return np.sqrt(np.dot(diff, diff)) / len(diff)

def predict_test_values(model, X_train, y_train, X_test):
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME

    model.fit(X_train, y_train)
    return model.predict(X_test)

    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME

def calc_train_and_test_error(model, X_train, y_train, X_test, y_test):

    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME

    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    error_train = mean_squared_error(y_train, y_pred_train)
    error_test = mean_squared_error(y_test, y_pred_test)

    return error_train, error_test

    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME
    ########## TODO REMOVE ME

    

## The Fundamental Theorems of Machine Learning
###<span style="color:green">**1) A predictive model is only as good as its predictions on unseen data **</span>
###<span style="color:green">**2) Error on the dataset we trained on is not a good predictor of error on future data**</span>

Why is this true?  Overfitting.

### Overfitting in A Picture

In [4]:
Image(url='http://radimrehurek.com/data_science_python/plot_bias_variance_examples_2.png')

## How to Fight Overfitting?
Ultimately we don't want to build a model which performs well on data we've already seen, we want to build a model which will perform well on data we haven't seen.

There are two linked strategies for to accomplish this: **regularization** and **model selection**.

## Regularization
The idea in regularization is that we're going to modify our loss function to penalize us for being too complex. Simple models are better.

One way to do this is to try to keep our regression coefficients small. Why would we want to do this? One intuitive explanation is that if we have big regression coefficients we'll get large changes in the predicted value from small changes in input value.  That's bad. Intuitively, our predictions should vary smoothly with the data.

So a model with smaller coefficients makes smoother predictions.  It is simpler, which means it will have a harder time overfitting. 

We can change our linear regression loss function to help us reduce overfitting.

###Linear Regression Loss Function
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2 \\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 \\   
\end{eqnarray*}

###L2 Regularized Linear Regression Loss Function -- "Ridge"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha ||\beta||_2^2\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \beta^T \beta\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \sum_{d=1}^D \beta_d^2\\
\end{eqnarray*}

We won't get into details, but a ridge regression model can be optimized in much the same way as an unregularized linear regression: either with using some form of gradient descent or matrix-based solutions. 

In [22]:
# Ridge Regression in scikit-learn
from sklearn import linear_model
model_ridge = linear_model.Ridge(alpha = .5)

### <span style="color:red">Ridge Regression Errors: 5 Minutes</span>
1. Partner up.  On one computer:
  1. Using your `calc_train_and_test_error` function from the previous exercise:
    1. Calculate the training and testing error for a LinearRegression model on the dataset below
    1. Calculate the training and testing error for a Ridge regression model with `alpha=0.1` on the dataset below

In [29]:
# load data
with np.load('data/mystery_data_old.npz') as data:
    celeb_data_old = data['celeb_data_old']
    popularity_old = data['popularity_old']
    celeb_data_new = data['celeb_data_new']

with np.load('data/mystery_data_new.npz') as data:
    popularity_new = data['popularity_new']


########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME

model_lr = linear_model.LinearRegression()
model_ridge = linear_model.Ridge(alpha = 0.1)

print calc_train_and_test_error(model_lr, celeb_data_old, popularity_old, celeb_data_new, popularity_new)
print calc_train_and_test_error(model_ridge, celeb_data_old, popularity_old, celeb_data_new, popularity_new)

########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME
########## TODO REMOVE ME


(0.045315706401795851, 0.060432838582568399)
(0.045315742240803411, 0.060426860720366779)


###L1 Regularized Linear Regression Loss Function -- "LASSO"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha ||\beta||_1\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \sum_{d=1}^D \beta_d\\
\end{eqnarray*}

In [13]:
# TODO: LASSO on dataset
clf = linear_model.Lasso(alpha = 0.1)

# TODO: interpretation, look at betas, plot betas
# TODO: the L1 diamond

How does this affect our $\beta$ vector

###L1 + L2 Regularized Linear Regression Loss Function -- "ElasticNet"
\begin{eqnarray*}
    Loss(\beta) = MSE &=& \frac{1}{2N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \rho ||\beta||_1 + \frac{\alpha (1 - \rho)}{2} ||\beta||_2^2\\\\
    &=& \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i^T\beta)^2 + \alpha \rho \sum_{d=1}^D \beta_d + \frac{\alpha (1 - \rho)}{2} \sum_{d=1}^D \beta_d^2\\
\end{eqnarray*}

In [None]:
from sklearn import linear_model
clf = linear_model.ElasticNet(alpha=1.0, l1_ratio=0.5)

##How to Choose $\alpha$ and/or $\rho$?  Cross Validation
There are many forms of cross validation.  The basic idea of each is to _train_ your model on some data and _estimate it's future performance_ on other data.

## Types of Cross Validation
### Validation Set Cross Validation
1. Pick an amount of data to be in your validation set (e.g. 10%)
2. Randomly split datapoints into training points (90%) and validation points (10%)
3. Train your model on the training data
4. Test your model on the validation data, record the validation error
5. Estimated future errors is the validation error


* **Good:** Easy and computationally cheap
* **Bad:** Statistically noisy and wastes data

### K-Fold Cross Validation
1. Partition the data into K folds
2. For each fold k:
  1. Train the model on all your data except the data in k
  2. Record the error on the the data in k
3. Estimate future error as total error across all folds


* **Good:** Computationally cheaper than leave one out cross validation, only wastes 100/k% of the data
* **Bad:** k times as expensive as just training one model, wastes 100/k% of the data

In [2]:
Image(url='https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png')

NameError: name 'Image' is not defined

## Hyperparameter Selection with Cross Validation
1. For each model:
  1. Estimate the model's performance on future data using cross validation
2. Pick the model with the best estimated future performance
3. Train the best model from scratch on the full dataset.  This is your model

In [15]:
# TODO: CV with data
from sklearn import linear_model
clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
clf = linear_model.LassoCV(alphas=[0.1, 1.0, 10.0])

## Caveats:
* You can still overfit with intensive cross validation!
* But it's much better than without

## Summary:
* **The Central Thesis of Machine Learning:** We're only interested in predictive performance on unseen data, NOT seen data.
* Training set error estimates error on **seen** data
* Cross validation error estimates error on **unseen** data
* Regularization is a way to improve your error on unseen data, but it introduces new hyperparameters
* Use a cross validated estimate of future performance to choose your model / hyperparameter settings