# Time for Crab

## Introduction

Crabs are here, and they're mighty tasty.

Knowing how old they are helps identify full-sized crabs that are ready for the pot.
 
![Crab](https://storage.googleapis.com/kaggle-datasets-images/1734027/2834512/a0e345e63b3e426ddb489d07cc0090cd/dataset-cover.jpg?t=2021-11-21-06-55-37)

Prediction (regression) of mud crab age based on physical features.



## Reasons for Choosing This Dataset

A good dataset is the foundation of a good model.

#### My reasons:

- Highly-rated tabular data with a natural prediction target (Age).
- Regression task since I like a challenge.
- Features easy to conceptualize for feature engineering.
- On the smaller side to quickly iterate on.
- Crabs are cool.

##### Reasons given by the [dataset on Kaggle](https://www.kaggle.com/datasets/sidhus/crab-age-prediction):

> Its a great starting point for classical regression analysis and feature engineering and understand the impact of feature engineering in Data Science domain.
> For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab's physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.


## Dataset Columns

The dataset contains the following columns:

---

| Column Name    | Description                                                                                         |
|----------------|-----------------------------------------------------------------------------------------------------|
| Sex            | Gender of the Crab - Male, Female and Indeterminate.                                                |
| Length         | Length of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Diameter       | Diameter of the Crab (in Feet; 1 foot = 30.48 cms)                                                  |
| Height         | Height of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Weight         | Weight of the Crab (in ounces; 1 Pound = 16 ounces)                                                 |
| Shucked Weight | Weight without the shell (in ounces; 1 Pound = 16 ounces)                                           |
| Viscera Weight | is weight that wraps around your abdominal organs deep inside body (in ounces; 1 Pound = 16 ounces) |
| Shell Weight   | Weight of the Shell (in ounces; 1 Pound = 16 ounces)                                                |
| Age            | Age of the Crab (in months)                                                                         |


## Import Libraries

Let's get the dependencies out of the way.

In [41]:
import numpy as np
import pandas as pd

#from sklearn.svm import SVC
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

pd.set_option('mode.copy_on_write', True)
pd.set_option('future.no_silent_downcasting', True)


## Define Constants

The PREDICTION_TARGET constant defines the column from the dataset which we will predict.

In [42]:
DATASET_FILE = './datasets/CrabAgePrediction.csv' # 'https://www.kaggle.com/sidhus/crab-age-prediction/download' or './data/CrabAgePrediction.csv'
PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


## Define Helper Functions

These functions will help us clean, augment, and normalize the data.

### DataFrame Display Function

This will be used throughout the notebook to display the DataFrame.

![Mud crab intro](https://dfzljdn9uc3pi.cloudfront.net/2021/10936/1/fig-1-2x.jpg)



In [43]:
def display_df(df:pd.DataFrame, show_info:bool=True, show_missing:bool=False, show_distinct:bool=False) -> None:
    """Display the DataFrame.

    :param df: The data.
    :param show_info: Whether to show info on the data.
    :param show_missing: Whether to show missing data counts.
    :param show_distinct: Whether to show distinct values.
    """
    print(f'DataFrame shape: {df.shape}')
    print(f'First 5 rows:\n{df.head()}') # preview the first 5 rows
    if show_info:
        print(f'Info:\n{df.info()}')
    if show_missing:
        print(f'Missing values:\n{df.isna().sum()}')
    if show_distinct:
        for col in df:
            print(f'{col} distinct values:\n{df[col].unique()[0:10]}')


### Dataset Cleanup

Dirty data is no good for crabs. Let's clean it up.

![How to clean a crab](https://www.recipetineats.com/wp-content/uploads/2021/07/Cleaning-and-preparing-crab-template-2.jpg)

#### Crab Cleaning Steps

- Drop rows missing required columns.
- Drop rows missing too many values.
- Convert natural booleans
    - E.g., `Y/N` or `Positive/Negative` to `0/1`.
- Fill nulls for typically-binary variables with the median.
- Fill nulls for typically-continuous variables with the median.
- Fill nulls for typically-categorical variables with default values.
    - E.g., `Unknown`, but domain knowledge is required here.
- Fill nulls for typically-text variables with empty strings.
- One-hot encode categorical variables.


In [44]:
def data_cleanup(df:pd.DataFrame) -> pd.DataFrame:
    """Clean-up the DataFrame for crabs.

    Update values:
        - Drop rows missing required columns.
        - Drop rows missing too many values.
        - Convert natural booleans
            - E.g., `Y/N` or `Positive/Negative` to `0/1`.
        - Fill nulls for typically-binary variables with `0.5`.
        - Fill nulls for typically-continuous variables with the median.
        - Fill nulls for typically-categorical variables with default values.
            - E.g., `Unknown`
        - Fill nulls for typically-text variables with empty strings.
        - One-hot encode categorical variables.

    :param df: The data.
    :return: The data without disposals.
    """
    # remove rows missing too many values
    df = df.dropna(thresh=3)

    # remove rows missing required columns
    df = df.dropna(subset=REQUIRED_COLUMNS)

    # convert natural booleans
    df = df.replace(to_replace={
        False: 0, True: 1,
        'negative': 0, 'positive': 1,
        'No': 0, 'Yes': 1,
    })

    # fill nulls for typically-binary variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-continuous variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-categorical variables with default values
    # df['  '] = df['  '].fillna('Unknown')

    # fill nulls for typically-text variables with empty strings
    # df['  '] = df['  '].fillna('')

    # one-hot encode categorical variables
    df = pd.get_dummies(df, columns=['Sex'])

    # determine which features are most important
    return df


### Data Augmentation

Crabs are complex creatures. Let's engineer some features to help our model find the best crabs for harvest.

We'll need to use domain knowledge to extract more features from our dataset's column.

![This kills the crab.](https://i.kym-cdn.com/photos/images/newsfeed/000/112/843/killcrab.jpg)

For example, we can find the edible weight of the crab by subtracting the viscera weight from the shucked weight.


In [45]:
def data_augmentation(df:pd.DataFrame) -> pd.DataFrame:
    """Add new features to the DataFrame.

    Driven by domain knowledge.

    :param df: The data.
    :return: The data with new features.
    """
    # add new features by combining existing features
    df['Edible Weight'] = df['Shucked Weight'] - df['Viscera Weight']
    return df


### Data Normalization

Crabs come in all shapes and sizes. Let's normalize the data to help our model.

![Tiny crab](https://www.popsci.com/uploads/2022/02/09/fiddler-crab.jpg?auto=webp&optimize=high&width=1440)

The book *Designing Machine Learning Systems* (Huyen, 2022) suggests normalizing to a range of [-1, 1] helps in practice.


In [46]:
def data_normalization(df:pd.DataFrame, a:float=-1., b:float=1.) -> pd.DataFrame:
    """Normalize the DataFrame from a to b.
    
    :param df: The data.
    :param a: The minimum value.
    :param b: The maximum value.
    :return: The normalized data.
    """
    # scale the data to a range of [a, b]
    df = a + ((df - df.min()) * (b - a)) / (df.max() - df.min())
    return df


## Exploratory Data Analysis

Get that pot of water ready. It's crab cookin' time.

![Crab pot](https://chefscornerstore.com/product_images/uploaded_images/steaming-crabs.jpg)

#### Open the file and display using our helper function

Analyzing the output here will help us revise our data cleanup and augmentation functions.


In [47]:
crabs = pd.read_csv(DATASET_FILE)  # load the data
display_df(crabs, show_info=True, show_missing=True, show_distinct=True)


DataFrame shape: (3893, 9)
First 5 rows:
  Sex  Length  Diameter  Height     Weight  Shucked Weight  Viscera Weight  \
0   F  1.4375    1.1750  0.4125  24.635715       12.332033        5.584852   
1   M  0.8875    0.6500  0.2125   5.400580        2.296310        1.374951   
2   I  1.0375    0.7750  0.2500   7.952035        3.231843        1.601747   
3   F  1.1750    0.8875  0.2500  13.480187        4.748541        2.282135   
4   I  0.8875    0.6625  0.2125   6.903103        3.458639        1.488349   

   Shell Weight  Age  
0      6.747181    9  
1      1.559222    6  
2      2.764076    6  
3      5.244657   10  
4      1.700970    6  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3893 entries, 0 to 3892
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             3893 non-null   object 
 1   Length          3893 non-null   float64
 2   Diameter        3893 non-null   float64
 3   Height          

### Cleanup the Data

Let's clean up the data and display it again.

#### Missing Values

No missing values! We're off to a good start with this dataset. Will crab be on the menu tonight?

#### Non-numeric Data

It looks like 'Sex' is a categorical variable. We'll need to convert this to a numeric value. Let's use **one-hot encoding**.


In [48]:
crabs = data_cleanup(crabs)
display_df(crabs, show_info=True, show_missing=False, show_distinct=True)


DataFrame shape: (3893, 11)
First 5 rows:
   Length  Diameter  Height     Weight  Shucked Weight  Viscera Weight  \
0  1.4375    1.1750  0.4125  24.635715       12.332033        5.584852   
1  0.8875    0.6500  0.2125   5.400580        2.296310        1.374951   
2  1.0375    0.7750  0.2500   7.952035        3.231843        1.601747   
3  1.1750    0.8875  0.2500  13.480187        4.748541        2.282135   
4  0.8875    0.6625  0.2125   6.903103        3.458639        1.488349   

   Shell Weight  Age  Sex_F  Sex_I  Sex_M  
0      6.747181    9   True  False  False  
1      1.559222    6  False  False   True  
2      2.764076    6  False   True  False  
3      5.244657   10   True  False  False  
4      1.700970    6  False   True  False  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3893 entries, 0 to 3892
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3893 non-null   float64
 1   Di

### Save the Data

So we can pick this back up on the next step.


In [49]:
crabs.to_json('./cache/crabs.json')