# Model selection with test sets

When we evaluated the linear regression model in section 03, it was mentioned that it's not recommended to assess a model by evaluating it in the same dataset that was used to train it.

This recommendation comes from the fact that a model may *overfit* to a specific dataset: it models that data so well that the model is not able to work with new information.

One way to avoid it is to split our data into a **training** dataset and a **testing** dataset: the former is used to build different models; the latter is used to select the best one.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

boston = pd.read_csv('datasets/boston.csv')

To split our data, we'll be using scikit-learn's *train_test_split()* function. This function receives a dataset and a parameter specifying what proportion of it should be allocated for training and for testing:

In [2]:
from sklearn.cross_validation import train_test_split
boston_training, boston_testing = train_test_split(boston, test_size=0.25)

Let's create two linear regression models using our training dataset: one containing all the features and another one containing only two features.

In [3]:
y_training = boston_training.medv.values.reshape(-1, 1)
y_testing = boston_testing.medv.values.reshape(-1, 1)

X1_training = boston_training.drop('medv', axis=1).values
X1_testing = boston_testing.drop('medv', axis=1).values
X2_training = boston_training[['crim', 'age']].values
X2_testing = boston_testing[['crim', 'age']].values

In [4]:
from sklearn.linear_model import LinearRegression

lr1 = LinearRegression().fit(X1_training, y_training)
lr2 = LinearRegression().fit(X2_training, y_training)

In [5]:
from sklearn.metrics import mean_squared_error

print('All features error: ', mean_squared_error(y_testing, lr1.predict(X1_testing)))
print('Crime and age features error: ', mean_squared_error(y_testing, lr2.predict(X2_testing)))

All features error:  21.0088072501
Crime and age features error:  64.1549200059
