# Model Selection

*In which we choose the best model to predict the age of a crab.*


### Define Constants


In [56]:
CACHE_FILE = '../cache/crabs.json'
NEXT_CACHE_FILE = '../cache/normlcrabs.json'
NEXT_NOTEBOOK = '../2-features/features.ipynb'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


### Importing Libraries


In [57]:
from notebooks.time_for_crab.mlutils import data_downcasting, display_df

import numpy as np
import pandas as pd

#from sklearn.svm import SVC
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

pd.set_option('mode.copy_on_write', True)


### Load Data from Cache

In the [previous section](../0-eda/overfit.ipynb), we saved the cleaned data to a cache file. Let's load it back.


In [58]:
crabs = pd.read_json(CACHE_FILE)
crabs.info()


<class 'pandas.core.frame.DataFrame'>
Index: 3893 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3893 non-null   float64
 1   Diameter        3893 non-null   float64
 2   Height          3893 non-null   float64
 3   Weight          3893 non-null   float64
 4   Shucked Weight  3893 non-null   float64
 5   Viscera Weight  3893 non-null   float64
 6   Shell Weight    3893 non-null   float64
 7   Age             3893 non-null   int64  
 8   Sex_F           3893 non-null   bool   
 9   Sex_I           3893 non-null   bool   
 10  Sex_M           3893 non-null   bool   
dtypes: bool(3), float64(7), int64(1)
memory usage: 285.1 KB


### Memory Reduction

Crabs were never known for their memory. Let's minimize the memory of our DataFrame using the smallest data types to fit the data.

The reason for this is to save computational resources and time. The smaller the data, the faster the processing.


In [59]:
crabs = data_downcasting(crabs)
display_df(crabs, show_info=True)


Memory usage of dataframe is 0.2784 MB (before)
Memory usage of dataframe is 0.0965 MB (after)
Reduced 65.3%
DataFrame shape: (3893, 11)
First 5 rows:
     Length  Diameter    Height     Weight  Shucked Weight  Viscera Weight  \
0  1.437500  1.174805  0.412598  24.640625       12.335938        5.585938   
1  0.887695  0.649902  0.212524   5.402344        2.296875        1.375000   
2  1.037109  0.774902  0.250000   7.953125        3.232422        1.601562   
3  1.174805  0.887695  0.250000  13.476562        4.750000        2.281250   
4  0.887695  0.662598  0.212524   6.902344        3.458984        1.488281   

   Shell Weight  Age  Sex_F  Sex_I  Sex_M  
0      6.746094    9   True  False  False  
1      1.559570    6  False  False   True  
2      2.763672    6  False   True  False  
3      5.246094   10   True  False  False  
4      1.701172    6  False   True  False  
<class 'pandas.core.frame.DataFrame'>
Index: 3893 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column   

### Split the Data

Let's split the data into training and testing sets.

It is important to split the data before any data augmentation or normalization to avoid data leakage.  
Data leakage lets the model learn from the testing data, which can lead to overfitting.

In more general terms, *data leakage* is the phenomenon when the form of a label "leaks" into the training feature set.
An example this of occurred in 2021 for diagnosing Covid patients. Patients lying down on a bed were more likely to be "diagnosed" with Covid.
This is because patients confirmed to have Covid were more inclined to bed rest (Huyen, 2022). 


In [60]:
# split features from target
X = crabs.drop([PREDICTION_TARGET], axis=1)
y = crabs[PREDICTION_TARGET]
# 80% training, 20% testing
train_size = int(0.8 * len(crabs))
# shuffle the data
random_indices = np.random.default_rng(42).permutation(crabs.index)
# split into train/test sets
X_train = X.iloc[random_indices[0:train_size]]
X_test = X.drop(X_train.index)
y_train = y.iloc[random_indices[0:train_size]]
y_test = y.drop(y_train.index)
print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')


X_train: (3114, 10)
X_test: (779, 10)


### Data Normalization

Crabs come in all shapes and sizes. Let's normalize the data to help our model make better sense of it.

![Tiny crab](https://www.popsci.com/uploads/2022/02/09/fiddler-crab.jpg?auto=webp&optimize=high&width=1440)

The book *Designing Machine Learning Systems* (Huyen, 2022) suggests normalizing to a range of [-1, 1] helps in practice.

Data normalization can help avoid data leakage based on the "form" of the data.

### Check Data Variance

In [61]:
X_train.select_dtypes(include=[np.number]) \
        .describe().transpose()[['mean','std']]


  return umr_sum(a, axis, dtype, out, keepdims, initial, where)


Unnamed: 0,mean,std
Length,1.310547,0.304443
Diameter,1.019531,0.251221
Height,0.349365,0.108032
Weight,inf,14.0625
Shucked Weight,10.203125,6.3125
Viscera Weight,5.144531,3.146484
Shell Weight,6.792969,3.994141


In [62]:
def data_normalization(df:pd.DataFrame, a:float=-1., b:float=1.) -> pd.DataFrame:
    """Normalize the DataFrame from a to b.
    
    :param df: The data.
    :param a: The minimum value.
    :param b: The maximum value.
    :return: The normalized data.
    """
    # scale the data to a range of [a, b]
    df = a + ((df - df.min()) * (b - a)) / (df.max() - df.min())
    return df


In [63]:
for col in X_train.select_dtypes(include=[np.number]).columns:
    X_train[col] = data_normalization(X_train[col])
X_train.describe().transpose()[['mean','std']]


Unnamed: 0,mean,std
Length,0.213989,0.329102
Diameter,0.185913,0.337891
Height,-0.75293,0.076477
Weight,-0.412354,0.351318
Shucked Weight,-0.467285,0.330566
Viscera Weight,-0.523438,0.292236
Shell Weight,-0.467773,0.314941


### Save the Data

So we can pick this back up on the [next step](../2-features/features.ipynb).


In [64]:
crabs.to_json(NEXT_CACHE_FILE)


### Onwards to Feature Engineering

See the [next section](../2-features/features.ipynb) for feature engineering.
