(sec-lab2)=
# Lab 2: Training vs testing error

**Goals**: 
* Explore the difference between training error and test error.
* Learn how to standardize data.

**Useful commands**:
* sklearn.model_selection.train_test_split
* sklearn.preprocessing.PolynomialFeatures
* sklearn.preprocessing.StandardScaler

## Training vs testing error

Recall that when we train a model, we select its parameters in order to minimize the prediction error on a training set. In order to measure how well the model will do on new data, we can evaluate it on a testing set (a part of the data that the model has not seen during training). 

We are going to work with the *diabetes* dataset (see the <a href="https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset" target="_blank">documentation</a>)

Let us load the dataset and create more features using polynomial transformations: 


In [23]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()
X = dataset['data']
y = dataset['target'].reshape(-1,1)

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3,include_bias = False)
pdata = poly.fit_transform(X)

Note: NumPy has two types of n-dimensional vectors: those with shape (n,) and those with shape (n,1). When loaded, the variables y is of (n,) type. The *StandardScaler* that we will use later works with (n,1) vectors. The *reshape* command brings the vector to the (n,1) format to make sure we do not run into problems later. 

Let us display the dimension of the transformed dataset:

In [24]:
print(pdata.shape)

(442, 285)


We can now split the data into a training and a testing set. (Note: the *random_state* is used for initializing the the random number generator that decides which samples go to the training and the testing sets. Setting it to a fixed value ensures the code will always produce the same results when run again.)

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pdata, y, test_size=0.25, random_state=42)

Our data has now been split into training and test sets. For example, the size of the training features is: 

In [26]:
print(X_train.shape)

(331, 285)


We are now going to *scale* the data (i.e., subtract the mean and divide by the standard deviation). It is usually a good idea to scale data before fitting a model, especially if some of the variables are on different scales. While we could easily do that by hand, the StandardScaler from Scikit-learn can do it automatically.

In [27]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

scaler = StandardScaler().fit(X_test)
X_test_scaled = scaler.transform(X_test)

scaler = StandardScaler().fit(y_train)
y_train_scaled = scaler.transform(y_train)

scaler = StandardScaler().fit(y_test)
y_test_scaled = scaler.transform(y_test)


Let us now compare the training and the test errors made by a linear regression model using (1) only the first 10 variables, and (2) using all variables.

In [35]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train[:,0:10],y_train)

print(model.score(X_train[:, 0:10], y_train))
print(model.score(X_test[:, 0:10], y_test))

0.5190341891679049
0.4849058889476756


Recall that the *score* method returns the [coefficient of determination ($R^2$)](sec-r-squared) of the model.

In both cases, the training and testing $R^2$ are similar, at about $0.5$. Thus, the model is doing a reasonable job at modeling the training data and at predicting new values.

Let us now try with all variables:

In [36]:
model.fit(X_train,y_train)

print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.90983817573718
-55.921870574233004


 Notice how the model does a much better job at modeling values on the training set. This is because we are including significantly more variables. However, the $R^2$ is **negative** on the test set, meaning that the test MSE of the model is worse than a constant prediction equal to the mean y value of the test set. This strongly suggests that the model is [overfitting](S-overfitting).

```{note}
As we saw before, when training a linear regression model, the MSE of the model will always be less than the MSE of a constant model (as long as we are including an intercept in the linear model). This implies $R^2 \geq 0$. However, when testing, there is no such guarantee anymore: the model was trained on a different dataset so its performance could be worse than a constant prediction, resulting in a negative $R^2$.
```

```{admonition} Exercise
Use a *loop* to fit the linear regression model to the first $k$ variables, for $k=1,\dots, 285$. For each model, save the training and test $R^2$. Finally, use *matplotlib* to make a plot of the training and test $R^2$ as a function of $k$. 

You should observe that the training $R^2$ always increases, but the testing $R^2$ increases and then start decreasing as the model begins to overfit.
```

```{admonition} Exercise
Consider the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing" target="_blank">California Housing dataset</a> available with Scikit-learn. As above, split the samples into a training and testing set. Construct different linear models to predict the price of the houses using the 8 features and measure their performance on your test set.
```

```{admonition} Assignment
---
class: warning
---
Submit your Lab 2 work (including the above two exercise) as Homework 4 on <a href="https://sites.udel.edu/canvas/" target="_blank">Canvas</a>.