# Model Selection

##### *In which we choose the best model to predict the age of a crab.*

###### [GitHub Repository](https://github.com/ahester57/ai_workshop/tree/master/notebooks/time_for_crab/1-models)

###### [Notebook Viewer](https://nbviewer.jupyter.org/github/ahester57/ai_workshop/blob/master/notebooks/time_for_crab/1-models/models.ipynb)

###### [Kaggle Dataset](https://www.kaggle.com/sidhus/crab-age-prediction)


### Define Constants


In [1]:
%%time
CACHE_FILE = '../cache/crabs.feather'
NEXT_CACHE_FILE = '../cache/splitcrabs.feather'
NEXT_NOTEBOOK = '../2-features/features.ipynb'
MODEL_CHECKPOINT_FILE = '../cache/best_model.weights.h5'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex_F','Sex_M','Sex_I','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]

NUM_EPOCHS = 100
VALIDATION_SPLIT = 0.2


CPU times: total: 0 ns
Wall time: 0 ns


### Import Libraries


In [2]:
%%time
from notebooks.time_for_crab.mlutils import display_df, plot_training_loss, score_combine, score_comparator, score_model

import keras

keras_backend = keras.backend.backend()
print(f'Keras version: {keras.__version__}')
print(f'Keras backend: {keras_backend}')
if keras_backend == 'tensorflow':
    import tensorflow as tf
    print(f'TensorFlow version: {tf.__version__}')
    print(f'TensorFlow devices: {tf.config.list_physical_devices()}')
elif keras_backend == 'torch':
    import torch
    print(f'Torch version: {torch.__version__}')
    print(f'Torch devices: {torch.cuda.get_device_name(torch.cuda.current_device())}')
    # torch supports windows-native cuda, but CPU was faster for this task

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.set_option('mode.copy_on_write', True)


Keras version: 3.3.3
Keras backend: tensorflow
TensorFlow version: 2.16.1
TensorFlow devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
CPU times: total: 1.41 s
Wall time: 2.2 s


### Load Data from Cache

In the [previous section](../0-eda/overfit.ipynb), we saved the cleaned data to a cache file. Let's load it back.


In [3]:
%%time
crabs = pd.read_feather(CACHE_FILE)
display_df(crabs, show_distinct=True)


DataFrame shape: (3790, 11)
First 5 rows:
     Length  Diameter    Height     Weight  Shucked Weight  Viscera Weight  \
0  1.437500  1.174805  0.412598  24.640625       12.335938        5.585938   
1  0.887695  0.649902  0.212524   5.402344        2.296875        1.375000   
2  1.037109  0.774902  0.250000   7.953125        3.232422        1.601562   
3  1.174805  0.887695  0.250000  13.476562        4.750000        2.281250   
4  0.887695  0.662598  0.212524   6.902344        3.458984        1.488281   

   Shell Weight  Age  Sex_F  Sex_I  Sex_M  
0      6.746094    9   True  False  False  
1      1.559570    6  False  False   True  
2      2.763672    6  False   True  False  
3      5.246094   10   True  False  False  
4      1.701172    6  False   True  False  
<class 'pandas.core.frame.DataFrame'>
Index: 3790 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3790 non-nul

### Split the Data

Let's split the data into training and testing sets.

It is important to split the data before any data augmentation or normalization to avoid data leakage.  
Data leakage lets the model learn from the testing data, which can lead to overfitting.

In more general terms, *data leakage* is the phenomenon when the form of a label "leaks" into the training feature set.
An example this of occurred in 2021 for diagnosing Covid patients. Patients lying down on a bed were more likely to be "diagnosed" with Covid.
This is because patients confirmed to have Covid were more inclined to bed rest (Huyen, 2022). 

#### Importance of Data Shuffling

Shuffling the data is important to avoid any biases in the data.
The order of data shouldn't matter, so shuffling helps mitigate any biases.

Shuffling should occur before the test-train split to be most effective.

We don't have to worry about time-series data right now
(although we could reverse order by 'Age' and call it time-series by new feature 'Crab Birthdate'),
but shuffling can have a big impact on the model's performance.


In [4]:
%%time
# split features from target
X = crabs.drop([PREDICTION_TARGET], axis=1)
y = crabs[PREDICTION_TARGET]

# 80% training, 20% testing
train_size = int((1. - VALIDATION_SPLIT) * len(X))

# shuffle the data
random_indices = np.random.default_rng(42).permutation(np.arange(len(X)))

# split into train/test sets
X_train = X.iloc[random_indices[1:train_size]]
X_test = X.drop(X_train.index)
y_train = y.iloc[random_indices[1:train_size]]
# save the prediction target
y_test = y.drop(y_train.index)

assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]

print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')


X_train: (3031, 10)
X_test: (759, 10)
CPU times: total: 0 ns
Wall time: 2 ms


## Metrics Used

Throughout this notebook, we will use the following metrics to evaluate the regression model:

#### Mean Squared Error
 
- The best score is 0.0
- Lower is better.

#### Mean Absolute Error

- The best score is 0.0
- Lower is better.
- Less sensitive to outliers.

#### Explained Variance Score

- The best score is 1.0
- Lower is worse.

#### R2 Score

- The best score is 1.0
- Lower is worse.

#### Max Error

- The max error is the very worst score.
- Lower is better.
- Domain-specific.
- 10 years is a lot for a crab.


## Model Selection

So far, we have not done any feature engineering, which can often be the most important part of the process.
Some new features could be constructed from our dataset which would call for a different model.
Nonetheless, we can start by using all features to set a baseline.
 
We will start with a few simple models to get a baseline accuracy.

We will use the following models:
- Naive Random Baseline
- Linear Regression
- Neural Networks
    - (64-32-16-8-1)
    - (32-16-8-1)
    - (16-8-1)
    - (8-1)
    - (4-1)
    - (2-1) 

### Naive Linear Regression

The simplest model is a naive linear regression model. It is untrained and will make random guesses.


In [5]:
%%time
# layer: input
layer_feature_input = keras.layers.Input(shape=(len(X_train.columns),))

# layer: normalizer
layer_feature_normalizer = keras.layers.Normalization(axis=-1)
layer_feature_normalizer.adapt(np.array(X_train))

# layer: output (linear regression)
layer_feature_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> linear
linear_model = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    layer_feature_output
])

linear_model.summary()


CPU times: total: 15.6 ms
Wall time: 50.1 ms


#### Configure the Linear Model

- **Optimizer**
    - Adam: Adaptive Moment Estimation [(Kingma & Ba, 2014)](https://arxiv.org/abs/1412.6980)
- **Loss Function**
    - Mean Squared Error (MSE)
        - This penalizes larger errors more than smaller errors.
        - We took out outliers in the data cleaning step, so this should perform better. 
- **Callbacks**
    - Model Checkpoint


In [6]:
%%time
linear_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_squared_error'
)

linear_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.keras', '_linear.keras'),
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    mode='min'
)


CPU times: total: 0 ns
Wall time: 3 ms


#### Score the Linear Model (Before Training)


In [7]:
%%time
untrained_linear_preds = linear_model.predict(X_test).flatten()
# Utility functions imported from mlutils.py
untrained_linear_scores_df = score_model(untrained_linear_preds, np.array(y_test), index='untrained_linear')
# Add it to the leaderboard
leaderboard_df = score_combine(pd.DataFrame(), untrained_linear_scores_df)
leaderboard_df.head()


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
CPU times: total: 15.6 ms
Wall time: 98.8 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024


#### Train the Linear Model


In [8]:
%%time
feature_rich_history = linear_model.fit(
    x=X_train,
    y=y_train,
    epochs=NUM_EPOCHS,
    verbose=0,
    validation_split=VALIDATION_SPLIT,
    callbacks=[linear_checkpoint]
)


CPU times: total: 1.38 s
Wall time: 7.56 s


#### Score the Linear Model


In [9]:
%%time
linear_preds = linear_model.predict(X_test).flatten()
linear_scores_df = score_model(linear_preds, np.array(y_test), index='linear')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, linear_scores_df)
leaderboard_df.head()


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 457us/step
CPU times: total: 15.6 ms
Wall time: 50.6 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024
linear,13.761063,3.103028,-0.216341,-2.931955,13.052973


### Neural Network Model

#### Neural Network Architecture

We will start with a deep (64-32-16-8-1) neural network with a few layers, gradually reducing the complexity from our overfit model.

- **Input Layer**
    - All of the features, please.
- **Normalizer Layer**
    - Adapted to all features in the training data. 
- **Hidden Layers**
    - Four dense layers each with 64 >> {layer_index} units and ReLU activation.
- **Output Layer**
    - Layer with one output.


In [10]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 64, 32, 16, 8
num_hidden_layers = 4
layer_deepest_hidden_relu_list = \
    [keras.layers.Dense(units=64>>_, activation='relu') for _ in range(num_hidden_layers)]

# layer: output (linear regression)
layer_deepest_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
deepest_model = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_deepest_hidden_relu_list,
    layer_deepest_output
])

deepest_model.summary()


CPU times: total: 0 ns
Wall time: 25 ms


#### Configure the Neural Network Model

- **Optimizer**
    - Adam: Adaptive Moment Estimation [(Kingma & Ba, 2014)](https://arxiv.org/abs/1412.6980)
- **Loss Function**
    - Mean Squared Error (MSE)
        - This penalizes larger errors more than smaller errors.
        - We took out outliers in the data cleaning step, so this should perform better. 
- **Callbacks**
    - Model Checkpoint


In [11]:
%%time
deepest_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_squared_error'
)

deepest_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.keras', '_deep.keras'),
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    mode='min'
)


CPU times: total: 0 ns
Wall time: 2 ms


#### Train the Neural Network Model

*We're not going to predict with the untrained model, as we already have a random baseline on the leaderboard.*


In [12]:
%%time
deepest_history = deepest_model.fit(
    x=X_train,
    y=y_train,
    epochs=250,
    verbose=0,
    validation_split=0.2,
    callbacks=[deepest_checkpoint]
)


CPU times: total: 5.33 s
Wall time: 21.2 s


#### Score the Neural Network Model


In [13]:
%%time
deepest_preds = deepest_model.predict(X_test).flatten()
deepest_scores_df = score_model(deepest_preds, np.array(y_test), index='64_32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deepest_scores_df)
leaderboard_df.head()


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 
CPU times: total: 46.9 ms
Wall time: 113 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024
linear,13.761063,3.103028,-0.216341,-2.931955,13.052973
64_32_16_8_1,3.972226,1.469996,0.289869,0.281731,8.788752


### Neural Network Model (32-16-8-1)

Let's cut the first layer out and see if it still has what it takes.


In [17]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 32, 16, 8
num_hidden_layers = 3
layer_32_16_8_hidden_relu_list = \
    [keras.layers.Dense(units=32>>_, activation='relu') for _ in range(num_hidden_layers)]

# layer: output (linear regression)
layer_32_16_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
deep_32_16_8_model = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_32_16_8_hidden_relu_list,
    layer_32_16_8_output
])

deep_32_16_8_model.summary()


CPU times: total: 0 ns
Wall time: 19.5 ms


#### Configure the (32-16-8-1) Neural Network Model


In [18]:
%%time
deep_32_16_8_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_squared_error'
)

deep_32_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.keras', '_32_16_8.keras'),
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    mode='min'
)


CPU times: total: 0 ns
Wall time: 2.01 ms


#### Train the (32-16-8-1) Neural Network Model

In [19]:
%%time
deep_32_16_8_history = deep_32_16_8_model.fit(
    x=X_train,
    y=y_train,
    epochs=NUM_EPOCHS,
    verbose=0,
    validation_split=VALIDATION_SPLIT,
    callbacks=[deep_32_16_8_checkpoint]
)


CPU times: total: 2.45 s
Wall time: 8.86 s


#### Score the (32-16-8-1) Neural Network Model


In [20]:
%%time
deep_32_16_8_preds = deep_32_16_8_model.predict(X_test).flatten()
deep_32_16_8_scores_df = score_model(deep_32_16_8_preds, np.array(y_test), index='32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_32_16_8_scores_df)
leaderboard_df.head()


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
CPU times: total: 46.9 ms
Wall time: 101 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024
linear,13.761063,3.103028,-0.216341,-2.931955,13.052973
64_32_16_8_1,3.972226,1.469996,0.289869,0.281731,8.788752
32_16_8_1,3.787848,1.434941,0.108063,0.105999,9.024017


### Neural Network Model (16-8-1)

The last one held up, so let's reduce it even more.


In [21]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 16, 8
num_hidden_layers = 2
layer_16_8_hidden_relu_list = \
    [keras.layers.Dense(units=16>>_, activation='relu') for _ in range(num_hidden_layers)]

# layer: output (linear regression)
layer_16_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
deep_16_8_model = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_16_8_hidden_relu_list,
    layer_16_8_output
])


CPU times: total: 0 ns
Wall time: 11.5 ms


#### Configure the (16-8-1) Neural Network Model


In [22]:
%%time
deep_16_8_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_squared_error'
)

deep_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.keras', '_16_8.keras'),
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    mode='min'
)


CPU times: total: 0 ns
Wall time: 2 ms


#### Train the (16-8-1) Neural Network Model


In [23]:
%%time
deep_16_8_history = deep_16_8_model.fit(
    x=X_train,
    y=y_train,
    epochs=NUM_EPOCHS,
    verbose=0,
    validation_split=VALIDATION_SPLIT,
    callbacks=[deep_16_8_checkpoint]
)


CPU times: total: 2.44 s
Wall time: 8.36 s


#### Score the (16-8-1) Neural Network Model


In [24]:
%%time
deep_16_8_preds = deep_16_8_model.predict(X_test).flatten()
deep_16_8_scores_df = score_model(deep_16_8_preds, np.array(y_test), index='16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_16_8_scores_df)
leaderboard_df.head()


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
CPU times: total: 46.9 ms
Wall time: 91.2 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024
linear,13.761063,3.103028,-0.216341,-2.931955,13.052973
64_32_16_8_1,3.972226,1.469996,0.289869,0.281731,8.788752
32_16_8_1,3.787848,1.434941,0.108063,0.105999,9.024017
16_8_1,3.892199,1.42842,0.222129,0.221435,9.011749


### Neural Network Model (8-1)

Down to the last two layers. Let's see who still has what it takes!


In [25]:
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model

# layer(s): hidden (relu) - 8
num_hidden_layers = 1
layer_8_hidden_relu_list = \
    [keras.layers.Dense(units=8>>_, activation='relu') for _ in range(num_hidden_layers)]

# layer: output (linear regression)
layer_8_output = keras.layers.Dense(units=1)

# architecture:
#   input -> normalizer -> hidden(s) -> dense
deep_8_model = keras.Sequential([
    layer_feature_input,
    layer_feature_normalizer,
    *layer_8_hidden_relu_list,
    layer_8_output
])


CPU times: total: 0 ns
Wall time: 8.51 ms


#### Configure the (8-1) Neural Network Model


In [26]:
%%time
deep_8_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_squared_error'
)

deep_8_checkpoint = keras.callbacks.ModelCheckpoint(
    MODEL_CHECKPOINT_FILE.replace('.keras', '_8.keras'),
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    mode='min'
)


CPU times: total: 0 ns
Wall time: 2 ms


#### Train the (8-1) Neural Network Model


In [27]:
%%time
deep_8_history = deep_8_model.fit(
    x=X_train,
    y=y_train,
    epochs=NUM_EPOCHS,
    verbose=0,
    validation_split=VALIDATION_SPLIT,
    callbacks=[deep_8_checkpoint]
)


CPU times: total: 2.53 s
Wall time: 7.96 s


#### Score the (8-1) Neural Network Model


In [29]:
%%time
deep_8_preds = deep_8_model.predict(X_test).flatten()
deep_8_scores_df = score_model(deep_8_preds, np.array(y_test), index='8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_8_scores_df)
leaderboard_df[:]


[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 501us/step
CPU times: total: 31.2 ms
Wall time: 52.8 ms


Unnamed: 0,mean_squared_error,mean_absolute_error,explained_variance_score,r2_score,max_error
untrained_linear,102.009781,9.765734,-2.710177,-55.997111,20.289024
linear,13.761063,3.103028,-0.216341,-2.931955,13.052973
64_32_16_8_1,3.972226,1.469996,0.289869,0.281731,8.788752
32_16_8_1,3.787848,1.434941,0.108063,0.105999,9.024017
16_8_1,3.892199,1.42842,0.222129,0.221435,9.011749
8_1,3.985609,1.463435,-0.004798,-0.00701,9.684846


## Model Leaderboard

### Reminder of Our Metrics

#TODO - accuracy by train vs test

| Model | Acc. on Training Set | Acc. on Validation Set |
| --- | --- | --- |
| Random baseline classifier | 0% | 0% |
| Logistic regression model | 0% | 0% |
| Neural network model (64-32-16-8-1) | 0% | 0% |
| Neural network model (32-16-8-1) | 0% | 0% |
| Neural network model (16-8-1) | 0% | 0% |
| Neural network model (8-1) | 0% | 0% |
| Neural network model (4-1) | 0% | 0% |
| Neural network model (2-1) | 0% | 0% |


## Save the Data

So we can pick this back up on the [next step](../2-features/features.ipynb).


In [14]:
%%time
# save the training and test data separately
pd.concat([X_train, y_train], axis=1, join='outer').to_feather(NEXT_CACHE_FILE)
pd.concat([X_test, y_test], axis=1, join='outer').to_feather(NEXT_CACHE_FILE.replace('.feather', '_test.feather'))


CPU times: total: 0 ns
Wall time: 5 ms


## Onwards to Feature Engineering

See the [next section](../2-features/features.ipynb) for feature engineering.

[`<html link>`](https://nbviewer.org/github/ahester57/ai_workshop/blob/master/notebooks/time_for_crab/2-features/features.ipynb) for feature reduction.
