# Time for Crab

Crabs are here, and they're mighty tasty.

Prediction (regression) of mud crab Age based on physical features.



### Reasons for Choosing This Dataset

##### Reasons given by the [dataset on Kaggle](https://www.kaggle.com/datasets/sidhus/crab-age-prediction):

> Its a great starting point for classical regression analysis and feature engineering and understand the impact of feature engineering in Data Science domain.
> For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab's physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.

##### My reasons:

- Highly-rated tabular data with a natural prediction target (Age).
- Regression task since I like a challenge.
- Features easy to conceptualize for feature engineering.

## Dataset Columns

| Column Name    | Description                                                                                         |
|----------------|-----------------------------------------------------------------------------------------------------|
| Sex            | Gender of the Crab - Male, Female and Indeterminate.                                                |
| Length         | Length of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Diameter       | Diameter of the Crab (in Feet; 1 foot = 30.48 cms)                                                  |
| Height         | Height of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Weight         | Weight of the Crab (in ounces; 1 Pound = 16 ounces)                                                 |
| Shucked Weight | Weight without the shell (in ounces; 1 Pound = 16 ounces)                                           |
| Viscera Weight | is weight that wraps around your abdominal organs deep inside body (in ounces; 1 Pound = 16 ounces) |
| Shell Weight   | Weight of the Shell (in ounces; 1 Pound = 16 ounces)                                                |
| Age            | Age of the Crab (in months)                                                                         |


## Import Libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

## Define Constants

The TARGET_LABEL defines which column from the dataset which we will be predicting.

In [6]:
#CONSTANT = 'value'
TARGET_LABEL = 'Age'    # 'Age' is predicted

## DataFrame Display Function


In [7]:
def display_df(df:pd.DataFrame, show_missing:bool=True, show_info:bool=False) -> None:
    """Display the DataFrame.
    
    :param df: The data.
    :type df: pd.DataFrame
    :param show_missing: Whether to show missing data counts.
    :type show_missing: bool
    :param show_info: Whether to show info on the data.
    :type show_info: bool
    """
    print(f'DataFrame shape: {df.shape}')
    print(f'First 5 rows:\n{df.head()}') # preview the first 5 rows
    if show_missing:
        # find any non-numeric data
        print(f'Missing values:\n{df.isna().sum()}')
    if show_info:
        print(f'Info:\n{df.info()}')

## Dataset Cleanup


In [8]:
def data_disposal(df:pd.DataFrame) -> pd.DataFrame:
    """Clean-up the DataFrame for crabs.
    
    :param df: The data.
    :type df: pd.DataFrame
    :return: The data without disposals.
    :rtype: pd.DataFrame
    """
    # remove rows missing too many values
    required_fields=[TARGET_LABEL, 'Cough_symptoms', 'Fever', 'Sore_throat', 'Shortness_of_breath', 'Headache']
    print(f'Dropping the rows missing required columns: {required_fields}')
    df = df.dropna(subset=required_fields)
    # weird find, get rid of rows where diagnosis is 'other'
    df = df[df[TARGET_LABEL] != 'other']
    return df