# Time for Crab

Crabs are here, and they're mighty tasty.

Knowing how old they are helps identify full-sized crabs that are ready for the pot.
 
![Crab](https://storage.googleapis.com/kaggle-datasets-images/1734027/2834512/a0e345e63b3e426ddb489d07cc0090cd/dataset-cover.jpg?t=2021-11-21-06-55-37)

Prediction (regression) of mud crab age based on physical features.



## Reasons for Choosing This Dataset

A good dataset is the foundation of a good model.

#### My reasons:

- Highly-rated tabular data with a natural prediction target (Age).
- Regression task since I like a challenge.
- Features easy to conceptualize for feature engineering.
- On the smaller side to quickly iterate on.
- Crabs are cool.

##### Reasons given by the [dataset on Kaggle](https://www.kaggle.com/datasets/sidhus/crab-age-prediction):

> Its a great starting point for classical regression analysis and feature engineering and understand the impact of feature engineering in Data Science domain.
> For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab's physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.


## Dataset Columns

The dataset contains the following columns:

---

| Column Name    | Description                                                                                         |
|----------------|-----------------------------------------------------------------------------------------------------|
| Sex            | Gender of the Crab - Male, Female and Indeterminate.                                                |
| Length         | Length of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Diameter       | Diameter of the Crab (in Feet; 1 foot = 30.48 cms)                                                  |
| Height         | Height of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Weight         | Weight of the Crab (in ounces; 1 Pound = 16 ounces)                                                 |
| Shucked Weight | Weight without the shell (in ounces; 1 Pound = 16 ounces)                                           |
| Viscera Weight | is weight that wraps around your abdominal organs deep inside body (in ounces; 1 Pound = 16 ounces) |
| Shell Weight   | Weight of the Shell (in ounces; 1 Pound = 16 ounces)                                                |
| Age            | Age of the Crab (in months)                                                                         |


## Import Libraries

Let's get the dependencies out of the way.

In [14]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None


## Define Constants

The PREDICTION_TARGET constant defines the column from the dataset which we will predict.

In [12]:
PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


## Define Helper Functions

These functions will help us clean, augment, and normalize the data.

### DataFrame Display Function

This will be used throughout the notebook to display the DataFrame.


In [11]:
def display_df(df:pd.DataFrame, show_missing:bool=True, show_info:bool=False, show_distinct:bool=False) -> None:
    """Display the DataFrame.
    
    :param df: The data.
    :type df: pd.DataFrame
    :param show_missing: Whether to show missing data counts.
    :type show_missing: bool
    :param show_info: Whether to show info on the data.
    :type show_info: bool
    :param show_distinct: Whether to show distinct values.
    :type show_distinct: bool
    """
    print(f'DataFrame shape: {df.shape}')
    print(f'First 5 rows:\n{df.head()}') # preview the first 5 rows
    if show_missing:
        # find any non-numeric data
        print(f'Missing values:\n{df.isna().sum()}')
    if show_info:
        print(f'Info:\n{df.info()}')
    if show_distinct:
        for col in df:
            print(f'{col} distinct values:\n{df[col].unique()}')


### Dataset Cleanup

Dirty data is no good for crabs. Let's clean it up.

![How to clean a crab](https://www.recipetineats.com/wp-content/uploads/2021/07/Cleaning-and-preparing-crab-template-2.jpg)

##### Crab Cleaning Steps

- Drop rows missing required columns.
- Drop rows missing too many values.
- Convert natural booleans
    - E.g., `Y/N` or `Positive/Negative` to `0/1`.
- Fill nulls for typically-binary variables with the median.
- Fill nulls for typically-continuous variables with the median.
- Fill nulls for typically-categorical variables with default values.
    - E.g., `Unknown`, but domain knowledge is required here.
- Fill nulls for typically-text variables with empty strings.


In [21]:
def data_cleanup(df:pd.DataFrame) -> pd.DataFrame:
    """Clean-up the DataFrame for crabs.

    Update values:
        - Drop rows missing required columns.
        - Drop rows missing too many values.
        - Convert natural booleans
            - E.g., `Y/N` or `Positive/Negative` to `0/1`.
        - Fill nulls for typically-binary variables with `0.5`.
        - Fill nulls for typically-continuous variables with the median.
        - Fill nulls for typically-categorical variables with default values.
            - E.g., `Unknown`
        - Fill nulls for typically-text variables with empty strings.

    :param df: The data.
    :type df: pd.DataFrame
    :return: The data without disposals.
    :rtype: pd.DataFrame
    """
    # remove rows missing too many values
    df = df.dropna(thresh=3)

    # remove rows missing required columns
    df = df.dropna(subset=REQUIRED_COLUMNS)

    # convert natural booleans
    df = df.replace(to_replace={
        False: 0, True: 1,
        'negative': 0, 'positive': 1,
        'No': 0, 'Yes': 1,
    })

    # fill nulls for typically-binary variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-continuous variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-categorical variables with default values
    # for example, fill nulls in '  ' with 'Unknown'
    # df['  '] = df['  '].fillna('Unknown')

    # fill nulls for typically-text variables with empty strings
    # for example, fill nulls in '  ' with ''
    # df['  '] = df['  '].fillna('')

    # determine which features are most important
    return df


### Data Augmentation

Crabs are complex creatures. Let's engineer some features to help our model find the best crabs for harvest.

We'll need to use domain knowledge to extract features from raw data which can be useful in training our model.

An example of this would be combining 


In [20]:
def data_augmentation(df:pd.DataFrame) -> pd.DataFrame:
    """Add new features to the DataFrame.

    Driven by domain knowledge.

    :param df: The data.
    :type df: pd.DataFrame
    :return: The data with new features.
    :rtype: pd.DataFrame
    """
    # add new features by combining existing features
    df['Edible Weight'] = df['Shucked Weight'] - df['Viscera Weight']
    return df


## Data Normalization

Crabs come in all shapes and sizes. Let's normalize the data to help our model.


In [17]:
def data_normalization(df:pd.DataFrame, a:float=-1., b:float=1.) -> pd.DataFrame:
    """Normalize the DataFrame from a to b.
    
    :param df: The data.
    :type df: pd.DataFrame
    :param a: The minimum value.
    :type a: float
    :param b: The maximum value.
    :type b: float
    :return: The normalized data.
    :rtype: pd.DataFrame
    """
    # scale the data to a range of [a, b]
    df = a + ((df - df.min()) * (b - a)) / (df.max() - df.min())
    return df
