# Overfitting

*In which we practice one thing so much that we get worse at everything else.*


### Define Constants


In [6]:
CACHE_FILE = '../cache/crabs.json'
NEXT_NOTEBOOK = '../1-models/models.ipynb'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


### Importing Libraries


In [7]:
from notebooks.time_for_crab.mlutils import data_downcasting, display_df

import keras
import numpy as np
import pandas as pd

#from sklearn.svm import SVC
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

pd.set_option('mode.copy_on_write', True)


ModuleNotFoundError: No module named 'tensorflow'

### Load Data from Cache

In the [previous section](../0-eda/eda.ipynb), we saved the cleaned data to a cache file. Let's load it back.


In [None]:
crabs = pd.read_json(CACHE_FILE)
display_df(crabs, show_info=True)


### Memory Reduction

Crabs were never known for their memory. Let's minimize the memory of our DataFrame using the smallest data types to fit the data.

The reason for this is to save computational resources and time. The smaller the data, the faster the processing.


In [None]:
crabs = data_downcasting(crabs)
display_df(crabs, show_info=True)


## Overfitting Crab Age

![Large mud crab measure](https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/CSIRO_ScienceImage_10696_Mud_crabs_are_caught_measured_tagged_and_released_as_part_of_the_research_into_the_effectiveness_of_green_zones_in_Moreton_Bay.jpg/1920px-CSIRO_ScienceImage_10696_Mud_crabs_are_caught_measured_tagged_and_released_as_part_of_the_research_into_the_effectiveness_of_green_zones_in_Moreton_Bay.jpg)


### Overfitting Methods

The goal here is to show how complex of a model it will take to overfit the data.

We will use the following methods to overfit the data:

1. **Linear Regression**: Gradually increase the complexity of a linear regression model.
2. **Model Stacking**: Stack models to overfit the model.


In [None]:
layers = keras.layers.
model = keras.models.S

### Don't Save this Data

We don't want our over-trained model to leak into the [next step](../1-models/models.ipynb).


### Onwards to Model Selection

See the [next section](../1-models/models.ipynb) for model selection.
