# Overfitting Lesson
#### Ben Wilson

## Supervised Learning Model Steps
When building a supervised model (like regression or classification) there is a general methodology you'd like to follow.

1) Gather the data set with predictor variables and the target variable to predict in the future.
2) Split the data set into training and testing data. Typically a 70 - 30 split.
3) Build the model based on the training data.
4) Try out the model with the data set aside for testing.
5) Fine tune your model using cross-validation to improve predictive accuracy.
6) Predict unkown data!

## Where to watch out for overfitting...
Overfitting can happen in steps 2 through 4. 

If you do not split your data set into training and testing, your model will be inheriently too optimistic. You have optimized just for that given set of data.

A similar pitfall can happen if you try to optimize just on the training data. The training data helps keep your model in check.

Also, sometimes the simplier models do a better job predicting than polynomic models that are too fitted to the current data set.

## Walkthrough example

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# Set seed for reporducible results
np.random.seed(414)

# Generate toy data
X = np.linspace(0, 15, 1000)
y = 3 * np.sin(X) + np.random.normal(1 + X, .2, 1000)

train_X, train_y = X[:700], y[:700]
test_X, test_y = X[700:], y[700:]

train_df = pd.DataFrame({'X': train_X, 'y': train_y})
test_df = pd.DataFrame({'X': test_X, 'y': test_y})

In [8]:
# Add in a small helper funtion
from IPython.core.display import HTML
def short_summary(est):
    return HTML(est.summary().tables[1].as_html())

In [9]:
# Linear Fit
poly_1 = smf.ols(formula='y ~ 1 + X', data=train_df).fit()
short_summary(poly_1)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,1.9959,0.152,13.104,0.000,1.697 2.295
X,0.8896,0.025,35.405,0.000,0.840 0.939


In [10]:
# Quadratic Fit
poly_2 = smf.ols(formula='y ~ 1 + X + I(X**2)', data=train_df).fit()
short_summary(poly_2)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,3.1458,0.221,14.261,0.000,2.713 3.579
X,0.2313,0.097,2.382,0.017,0.041 0.422
I(X ** 2),0.0627,0.009,7.004,0.000,0.045 0.080


You can see the quadratic regression performs better than the linear regression with the training data.

The quadratic regression has a lower confidence interval and standard error.

I suspect when we run the test data through it, the quadratic will not predict the target variable as well as the linear regression.

__How do I run the test data set through the model I just created and optimized for?__