# Overfitting

*In which we practice one thing so much that we get worse at everything else.*


### Define Constants


In [9]:
CACHE_FILE = '../cache/crabs.json'
NEXT_NOTEBOOK = '../1-models/models.ipynb'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


### Importing Libraries


In [10]:
from notebooks.time_for_crab.mlutils import *

import numpy as np
import pandas as pd

#from sklearn.svm import SVC
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

pd.set_option('mode.copy_on_write', True)


### Load Data from Cache

In the [previous section](../0-eda/eda.ipynb), we saved the cleaned data to a cache file. Let's load it back.


In [11]:
crabs = pd.read_json(CACHE_FILE)
crabs.info()


<class 'pandas.core.frame.DataFrame'>
Index: 3893 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3893 non-null   float64
 1   Diameter        3893 non-null   float64
 2   Height          3893 non-null   float64
 3   Weight          3893 non-null   float64
 4   Shucked Weight  3893 non-null   float64
 5   Viscera Weight  3893 non-null   float64
 6   Shell Weight    3893 non-null   float64
 7   Age             3893 non-null   int64  
 8   Sex_F           3893 non-null   bool   
 9   Sex_I           3893 non-null   bool   
 10  Sex_M           3893 non-null   bool   
dtypes: bool(3), float64(7), int64(1)
memory usage: 285.1 KB


### Memory Reduction

Crabs were never known for their memory. Let's minimize the memory of our DataFrame using the smallest data types to fit the data.

The reason for this is to save computational resources and time. The smaller the data, the faster the processing.


In [12]:
crabs = data_downcasting(crabs)
crabs.info()


Memory usage of dataframe is 0.2784 MB (before)
Memory usage of dataframe is 0.0965 MB (after)
Reduced 65.3%
<class 'pandas.core.frame.DataFrame'>
Index: 3893 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3893 non-null   float16
 1   Diameter        3893 non-null   float16
 2   Height          3893 non-null   float16
 3   Weight          3893 non-null   float16
 4   Shucked Weight  3893 non-null   float16
 5   Viscera Weight  3893 non-null   float16
 6   Shell Weight    3893 non-null   float16
 7   Age             3893 non-null   int8   
 8   Sex_F           3893 non-null   bool   
 9   Sex_I           3893 non-null   bool   
 10  Sex_M           3893 non-null   bool   
dtypes: bool(3), float16(7), int8(1)
memory usage: 98.8 KB


### Don't Save this Data

We don't wan't our over-trained model to leak into the [next step](../1-models/models.ipynb).


### Onwards to Model Selection

See the [next section](../1-models/models.ipynb) for model selection.
