# CPSC 4970 AI + ML: Module 3 Cross validation demos

The module3 notebook got too messy due to the incremental development of the
hyperparameter selection concepts.  In this notebook I will show
- pipelines
- cross validation and hyperparameter optimization with one parameter at a time
- cross validation and hyperparameter optimization through grid search

In [1]:
import sklearn.datasets
from IPython.core.display import display
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import validation_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

from math import sqrt

plt.style.use('dark_background')

## Load the data

In [2]:
db = sklearn.datasets.load_diabetes(as_frame=True)['frame']
train, test = train_test_split(db, test_size=0.33, random_state=0)
X_train = train.iloc[:, :-1]
X_test = test.iloc[:, :-1]
y_train = train['target']
y_test = test['target']

## Create the model development [pipeline](https://scikit-learn.org/stable/modules/compose.html):
1. Create polynomial features
2. Normalize the features
3. Perform a Lasso regression with normalized targets

As mentioned in video...[validation_curve with error bars!](https://scikit-learn.org/stable/modules/learning_curve.html)

In [3]:
pl = Pipeline([('poly', PolynomialFeatures()),
               ('norm', StandardScaler()),
               ('regr', TransformedTargetRegressor(transformer=StandardScaler()))])
param_grid = [
    {'poly__degree': [1, 2, 3, 4, 5, 6], 'regr__regressor': [Lasso(max_iter=100000)], 'regr__regressor__alpha': [0.001, 0.01, 0.1, 1]},
    {'poly__degree': [1, 2, 3, 4, 5, 6], 'regr__regressor': [Ridge()], 'regr__regressor__alpha': [0.0001, 0.001, 0.01, 0.1, 1]},
    {'poly__degree': [1, 2, 3, 4, 5, 6], 'regr__regressor': [LinearRegression()]}

]
model = GridSearchCV(pl, param_grid=param_grid, scoring='neg_mean_squared_error', n_jobs=-1)
model.fit(X_train, y_train)
print(model.best_params_)
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

{'poly__degree': 1, 'regr__regressor': Lasso(alpha=0.01, max_iter=100000), 'regr__regressor__alpha': 0.01}


In [4]:
print("Training RMSE: ", sqrt(mean_squared_error(y_train, pred_train)))
print("Training R2: ", r2_score(y_train, pred_train))
print("Test RMSE: ", sqrt(mean_squared_error(y_test, pred_test)))
print("Test R2: ", r2_score(y_test, pred_test))

Training RMSE:  52.65967800679518
Training R2:  0.5565921845635585
Test RMSE:  56.21374121526503
Test R2:  0.4001855368866206
