# Model Validation

In [None]:
# Import necessary libaries and data
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

advert = pd.read_csv('advertising.csv')

Any predictive model needs to be validated to see how it is performing on different sets of data. A model is no good to us if it only fits well on the data we used to create the model, but not on other data - a concept called **overfitting**.

One common method to validate a model is to split the dataset into a **training dataset** and a **testing dataset**.

### Training and testing data split

Ideally, splitting data should be done right at the onset of the modelling process.

Let's see how we can split our `advertising` dataset into training and testing datasets. We will fit a new model using just the training data, and apply the model to the testing data to see if the model can accurately predict outputs for inputs it has not seen before:

In [None]:
# Split the dataset 80:20
training = advert.sample(frac=0.8, random_state=0)
testing = advert.drop(training.index)

print(f'Number of training rows: {len(training)}.')
print(f'Number of testing rows: {len(testing)}.')

We have split the `advert` dataset 80:20 i.e. 80% of the rows (160/200) are in `training` and 20% (40/200) of the rows are in `testing`.

> There is no hard rule for the ratio you choose to split your data. A small training set can lead to greater variance in parameter estimates…but a small testing set can lead to greater variance in performance. So you should chose a ratio that minimizes both possible variances. Note that for large datasets, a 80:20 split will not be much different from a 90:10 or 70:30 split.

Using the combination of predictors we’ve already found to be the most efficient (`TV` + `Radio`), let's create a model using only the training data and test the model performance on the testing data:

In [None]:
# Fit model with TV and Radio as predictors to training data
model5 = smf.ols('Sales ~ TV + Radio', data=training).fit()
print(model5.summary())

Most of the model parameters, such as intercept, coefficient estimates, and *R<sup>2</sup>* are very similar. 

The difference in F-statistics can be attributed to a smaller dataset. The smaller the dataset, the larger the value of SSD and the smaller the value of the ($n-p-1$) term in F-statistic formula – both contributing towards the decrease in the F-statistic value.

The model can be written as:

![](https://latex.codecogs.com/gif.latex?%5Ctext%7BSales%7D%20%3D%202.882%20+%200.047*%5Ctext%7BTV%7D%20+%200.181*%5Ctext%7BRadio%7D)

Now let’s predict the sales values for the testing dataset:

In [None]:
training['sales_pred'] = model5.predict()
training.head(20)

We’ve created a new column to store the predictions, and displayed the first 20 rows of the training data so you can get a sense of how close our predictions were. 

Let’s calculate the RSE to get a statistical measurement of accuracy. We can get the RSE for both the training set and the testing set to see how well the model fits each of them:

In [None]:
# Store parameter values
alpha = model5.params[0]
beta1 = model5.params[1]
beta2 = model5.params[2]

# Calculate RSE for training set
training['SSD'] = (training['Sales'] - training['sales_pred'])**2
SSD = training['SSD'].sum()
RSE = np.sqrt(SSD / 157)   # n = 160, p = 2
salesmean = np.mean(training['Sales'])
error = RSE / salesmean

print(f'RSE = {RSE}\nMean sale = {salesmean}\nError = {np.round(error, 4)*100}%')

In [None]:
# Calculate RSE for testing set
testing['sales_pred'] = alpha + (beta1 * testing['TV']) + (beta2 * testing['Radio'])
testing['SSD'] = (testing['Sales'] - testing['sales_pred'])**2
SSD = testing['SSD'].sum()
RSE = np.sqrt(SSD / 37)   # n = 40, p = 2
salesmean = np.mean(testing['Sales'])
error = RSE / salesmean

print(f'RSE = {RSE}\nMean sale = {salesmean}\nError = {np.round(error, 4)*100}%')

A model that is **underfit** will have high training and high testing error while an overfit model will have extremely low training error but a high testing error.

The slightly higher RSE for `testing` implies a very small insignificant degree of overfitting.

### Summary

We created several models in the previous 4 steps. Here’s a summary for your reference:

| Name    | Type                                                 | Definition                     | R2/Adj-R2   | F-statistic | F-statistic (p-value) | RSE            |
| ------- | ---------------------------------------------------- | ------------------------------ | ----------- | ----------- | --------------------- | -------------- |
| Model 1 | Simple linear regression                             | Sales ~ TV                     | 0.612/0.610 | 312.1       | 1.47e-42              | 3.259 (23.24%) |
| Model 2 | Multiple linear regression (2 predictors)            | Sales ~ TV + Newspaper         | 0.646/0.642 | 179.6       | 3.95e-45              | 3.121 (22.26%) |
| Model 3 | Multiple linear regression (2 predictors)            | Sales ~ TV + Radio             | 0.897/0.896 | 859.6       | 4.83e-98              | 1.681 (11.99%) |
| Model 4 | Multiple linear regression (3 predictors)            | Sales ~ TV + Radio + Newspaper | 0.897/0.896 | 570.3       | 1.58e-96              | 1.686 (12.02%) |
| Model 5 | Multiple linear regression (2 predictors) with split | Sales ~ TV + Radio             | 0.908/0.901 | 722.8       | 6.27e-80              | 1.848 (13.16%) |

In this step, we had to manually split our data into training and testing sets. In our final step for this lesson, we will explore the `scikit-learn` library which has a built-in method to split our data for us!

Return to the notebook directory in Jupyter by pressing `File` > `Open…` in the toolbar at the top, then open the notebook called `2.5 Linear regression with scikit-learn.ipynb`.