## Phase 2.19.2

# Model Validation

## Objectives
- What is validation?
- Why do we validate?


- Linear model performance metrics


- Methods of validation
    - Train-Test Split
    - K-Fold Cross Validation

# Model Validation
When creating any model, the hope is that **you can train a model on this data that can then be used to make predictions about new data that comes in.** 

You want your model to generalize well and work on this incoming data - not too complex from learning all the details/noise from the data, but also not so simple that the model is useless. 

<img src='images/new_overfit_underfit.png'>

Building generalizable model that is also useful in its predictions is possible because of **model validation**.

# Linear Regression Metrics
<img src='images/residuals.png'>

## Mean Absolute Error
- **MAE** is defined as the absolute value of the mean of the residuals.
    - MAE is useful because it speaks in concrete figures about the average error in easily understandable units.
        - "*On average, how wrong is my model on each prediction that it makes?*"

## Mean Squared Error
- **MSE** is defined as the mean of the sum of squares of the residuals. 
    - MSE is useful because it penalizes large residuals more than small ones.
    
## Root Mean Squared Error
- **RMSE** is defined as the square root of the MSE.
     - RMSE is useful because it penalizes large errors, but returns the units back close to the original scale of the dependent variable.

# Train-Test Split

The idea behind a train-test split is to create a "holdout" dataset - one that your model does not see when training - that you can test against after you've trained.

<img src='images/traintestsplit.png'>

***Steps to a logical train-test split***
1. Use `sklearn`'s `train_test_split()` function to return `(X_train, X_test, y_train, y_test)`
    - Include a `random_state` for reproducibility.
        - *If being used with a classifier, split data proportionately with `stratify` parameter.*
2. Train your model with `X_train` and `y_train`.
3. Measure your model's performance on the **train-set** with one or more metrics.
4. Measure your model's performance on the **test-set** with the same metrics.
5. Compare the training metrics with the test metrics.
    - If the metrics are similar, you do not have a problem with ***overfitting***.
    - If the metrics are much better for your training set than your test set, that implies that the model has overfit to the training data and was not able to generalize effectively.

### *Brief processing*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

In [2]:
# Load in data.
df = sns.load_dataset('mpg')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [4]:
# Remove missing values. (Quick and dirty for demo.)
df.dropna(inplace=True)
df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
 7   origin        392 non-null    object 
 8   name          392 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 30.6+ KB


In [5]:
# Remove `name` column.
if 'name' in df.columns:
    df.drop('name', axis=1, inplace=True)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,usa
1,15.0,8,350.0,165.0,3693,11.5,70,usa
2,18.0,8,318.0,150.0,3436,11.0,70,usa
3,16.0,8,304.0,150.0,3433,12.0,70,usa
4,17.0,8,302.0,140.0,3449,10.5,70,usa


In [6]:
# Separate independant variables from dependant variable.
X = df.drop('mpg', axis=1)
y = df['mpg']
X.shape, y.shape

((392, 7), (392,))

## 1. Use `sklearn.model_selection.train_test_split`

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=51)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((294, 7), (98, 7), (294,), (98,))

In [8]:
X_train.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
262,8,305.0,145.0,3425,13.2,78,usa
225,6,250.0,110.0,3520,16.4,77,usa
196,4,98.0,60.0,2164,22.1,76,usa
46,4,140.0,72.0,2408,19.0,71,usa
291,8,267.0,125.0,3605,15.0,79,usa


In [9]:
y_train.head()

262    19.2
225    17.5
196    24.5
46     22.0
291    19.2
Name: mpg, dtype: float64

### 1.a OHE / StandardScaler at THIS step

In [10]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [11]:
# Splitting out numeric / categorical columns.
### Eventually we will learn to do this type of processing in a Pipeline.
num_cols = X_train.select_dtypes('number').columns
cat_cols = X_train.select_dtypes('object').columns
num_cols, cat_cols

(Index(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
        'model_year'],
       dtype='object'),
 Index(['origin'], dtype='object'))

In [12]:
X_train.select_dtypes('number').columns

Index(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
       'model_year'],
      dtype='object')

#### OHE

In [13]:
# One-hot-encode `origin`.
ohe = OneHotEncoder(drop='first', sparse=False)

In [14]:
# Fit and transform the train-data.
X_train_cat_processed = ohe.fit_transform(X_train[cat_cols])

In [15]:
# ONLY *transform* the test data.
X_test_cat_processed = ohe.transform(X_test[cat_cols])

In [16]:
X_train_cat_processed[:5]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [17]:
# Get names of features.
ohe_feature_names = ohe.get_feature_names(cat_cols)
ohe_feature_names

array(['origin_japan', 'origin_usa'], dtype=object)

In [18]:
# Transform to dataframes.
X_train_cat_processed = pd.DataFrame(
    X_train_cat_processed, 
    columns=ohe_feature_names
    )
X_test_cat_processed = pd.DataFrame(
    X_test_cat_processed, 
    columns=ohe_feature_names
    )

X_train_cat_processed.head()

Unnamed: 0,origin_japan,origin_usa
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0


#### StandardScaler

In [19]:
# Scale numeric features.
scaler = StandardScaler()
X_train_num_processed = scaler.fit_transform(X_train[num_cols])
X_test_num_processed = scaler.transform(X_test[num_cols])

# Transform to dataframes.
X_train_num_processed = pd.DataFrame(X_train_num_processed, columns=num_cols)
X_test_num_processed = pd.DataFrame(X_test_num_processed, columns=num_cols)

# Sanity check.
X_train_num_processed.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
0,1.471,1.062604,1.045298,0.506526,-0.855822,0.575584
1,0.295799,0.534105,0.126286,0.617699,0.328407,0.303085
2,-0.879402,-0.926475,-1.186588,-0.969156,2.437814,0.030587
3,-0.879402,-0.522894,-0.871499,-0.683615,1.290593,-1.331908
4,1.471,0.697459,0.520148,0.71717,-0.189693,0.848083


In [20]:
# Combine processed data.
X_train_processed = pd.concat(
    [X_train_num_processed, X_train_cat_processed], 
    axis=1
    )
X_test_processed = pd.concat(
    [X_test_num_processed, X_test_cat_processed], 
    axis=1
    )

X_train_processed.shape, X_test_processed.shape

((294, 8), (98, 8))

In [21]:
X_train_processed.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin_japan,origin_usa
0,1.471,1.062604,1.045298,0.506526,-0.855822,0.575584,0.0,1.0
1,0.295799,0.534105,0.126286,0.617699,0.328407,0.303085,0.0,1.0
2,-0.879402,-0.926475,-1.186588,-0.969156,2.437814,0.030587,0.0,1.0
3,-0.879402,-0.522894,-0.871499,-0.683615,1.290593,-1.331908,0.0,1.0
4,1.471,0.697459,0.520148,0.71717,-0.189693,0.848083,0.0,1.0


## 2. Train the model with the training data.

In [22]:
from sklearn.linear_model import LinearRegression

In [23]:
# Fit model with training data.
linreg = LinearRegression()
linreg.fit(X_train_processed, y_train)

LinearRegression()

## 3-4. Measure the metrics.

In [24]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [25]:
# Single operation example.
mean_absolute_error(y_train, linreg.predict(X_train_processed))

2.471271144554307

In [26]:
linreg.score(X_train_processed, y_train)

0.8328560918186806

In [27]:
linreg.score(X_test_processed, y_test)

0.7923853265201815

In [28]:
# Get predictions.
y_train_pred = linreg.predict(X_train_processed)
y_test_pred = linreg.predict(X_test_processed)

# Get metrics. Store in a dictionary.
metrics_dct = {}

# Iterate over training and test predictions.
for split, y_pred in [('train', y_train_pred), ('test', y_test_pred)]:
    # Get appropriate y-values to compare to.
    y_true = y_train.copy() if split == 'train' else y_test.copy()
    
    # Get metrics by comparing the true values to the predicted values.
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    split_dct = {
        'mse': mse,
        'mae': mae,
        'rmse': np.sqrt(mse)
    }
    metrics_dct[split] = split_dct
    
# Show results.
metrics_dct

{'train': {'mse': 10.322534571868092,
  'mae': 2.471271144554307,
  'rmse': 3.212870145503564},
 'test': {'mse': 11.881078708555085,
  'mae': 2.67803664857758,
  'rmse': 3.4468940669180834}}

## 5. Compare metrics.

In [29]:
pd.DataFrame(metrics_dct)

Unnamed: 0,train,test
mse,10.322535,11.881079
mae,2.471271,2.678037
rmse,3.21287,3.446894


# KFold Cross Validation
<img src='images/kfold.png'>
KFolds-CV is built on the same concept of the train-test-split (and performs the same cross-validation check), but folds the data $k$-times in order to achieve a more robust result to combat overfitting.

## Practical Use

In [30]:
from sklearn.model_selection import cross_validate

In [31]:
# Create model.
linreg = LinearRegression()

# Get CV scores.
scores = cross_validate(
    linreg, 
    X_train_processed, 
    y_train, 
    cv=5,
    scoring=('r2', 'neg_mean_squared_error')
    )
scores

{'fit_time': array([0.00318694, 0.00286889, 0.00279188, 0.00276303, 0.00346184]),
 'score_time': array([0.00268412, 0.00238013, 0.00229692, 0.00224614, 0.00349903]),
 'test_r2': array([0.80268185, 0.87307597, 0.80070871, 0.7810437 , 0.7938275 ]),
 'test_neg_mean_squared_error': array([-11.69724911,  -7.80836967, -14.34231814, -13.60523721,
        -10.30720635])}

In [32]:
test_r2_score = scores['test_r2']
test_r2_score

array([0.80268185, 0.87307597, 0.80070871, 0.7810437 , 0.7938275 ])

In [33]:
# Isolate scores.
test_scores = scores['test_neg_mean_squared_error']
test_scores

array([-11.69724911,  -7.80836967, -14.34231814, -13.60523721,
       -10.30720635])