# Exploratory Data Analysis

*In which we learn more about crabs and overfit on purpose.*


## Introduction

Crabs are here, and they're mighty tasty.

Knowing how old they are helps identify full-sized crabs that are ready for the pot.
 
![Crab](https://storage.googleapis.com/kaggle-datasets-images/1734027/2834512/a0e345e63b3e426ddb489d07cc0090cd/dataset-cover.jpg?t=2021-11-21-06-55-37)

Prediction (regression) of mud crab age based on physical features.

## Reasons for Choosing This Dataset

A good dataset is the foundation of a good model.

#### My reasons:

- Highly-rated tabular data with a natural prediction target (Age).
- Regression task since I like a challenge.
- Features easy to conceptualize for feature engineering.
- On the smaller side to quickly iterate on.
- Crabs are cool.

##### Reasons given by the [dataset on Kaggle](https://www.kaggle.com/datasets/sidhus/crab-age-prediction):

> Its a great starting point for classical regression analysis and feature engineering and understand the impact of feature engineering in Data Science domain.
> For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab's physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.

### Dataset Columns

The dataset contains the following columns:

---

| Column Name    | Description                                                                                         |
|----------------|-----------------------------------------------------------------------------------------------------|
| Sex            | Gender of the Crab - Male, Female and Indeterminate.                                                |
| Length         | Length of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Diameter       | Diameter of the Crab (in Feet; 1 foot = 30.48 cms)                                                  |
| Height         | Height of the Crab (in Feet; 1 foot = 30.48 cms)                                                    |
| Weight         | Weight of the Crab (in ounces; 1 Pound = 16 ounces)                                                 |
| Shucked Weight | Weight without the shell (in ounces; 1 Pound = 16 ounces)                                           |
| Viscera Weight | is weight that wraps around your abdominal organs deep inside body (in ounces; 1 Pound = 16 ounces) |
| Shell Weight   | Weight of the Shell (in ounces; 1 Pound = 16 ounces)                                                |
| Age            | Age of the Crab (in months)                                                                         |


### Define Constants

The PREDICTION_TARGET constant defines the column from the dataset which we will predict.


In [8]:
DATASET_FILE = '../datasets/CrabAgePrediction.csv' # 'https://www.kaggle.com/sidhus/crab-age-prediction/download' or './data/CrabAgePrediction.csv'
PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


### Importing Libraries

In [9]:
import numpy as np
import pandas as pd

#from sklearn.svm import SVC
#from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

try:
    # for visual mode. `pip install -e .[visual]`
    import matplotlib.pyplot as plt
    import matplotlib
    %matplotlib inline
    import seaborn as sns
except ModuleNotFoundError:
    plt = None
    sns = None

pd.set_option('mode.copy_on_write', True)
pd.set_option('future.no_silent_downcasting', True)


### DataFrame Display Function

This will be used throughout the notebook to display the DataFrame.


In [10]:
def display_df(df:pd.DataFrame, show_info:bool=True, show_missing:bool=False, show_distinct:bool=False) -> None:
    """Display the DataFrame.

    :param df: The data.
    :param show_info: Whether to show info on the data.
    :param show_missing: Whether to show missing data counts.
    :param show_distinct: Whether to show distinct values.
    """
    print(f'DataFrame shape: {df.shape}')
    print(f'First 5 rows:\n{df.head()}') # preview the first 5 rows
    if show_info:
        print(f'Info:\n{df.info()}')
    if show_missing:
        print(f'Missing values:\n{df.isna().sum()}')
    if show_distinct:
        for col in df:
            print(f'{col} distinct values:\n{df[col].unique()[0:10]}')


### Dataset Cleanup

Dirty data is no good for crabs. Let's clean it up.

![How to clean a crab](https://www.recipetineats.com/wp-content/uploads/2021/07/Cleaning-and-preparing-crab-template-2.jpg)

#### Crab Cleaning Steps

- Drop rows missing required columns.
- Drop rows missing too many values.
- Convert natural booleans
    - E.g., `Y/N` or `Positive/Negative` to `0/1`.
- Fill nulls for typically-binary variables with the median.
- Fill nulls for typically-continuous variables with the median.
- Fill nulls for typically-categorical variables with default values.
    - E.g., `Unknown`, but domain knowledge is required here.
- Fill nulls for typically-text variables with empty strings.
- One-hot encode categorical variables.


In [11]:
def data_cleanup(df:pd.DataFrame) -> pd.DataFrame:
    """Clean-up the DataFrame for crabs.

    Update values:
        - Drop rows missing required columns.
        - Drop rows missing too many values.
        - Convert natural booleans
            - E.g., `Y/N` or `Positive/Negative` to `0/1`.
        - Fill nulls for typically-binary variables with `0.5`.
        - Fill nulls for typically-continuous variables with the median.
        - Fill nulls for typically-categorical variables with default values.
            - E.g., `Unknown`
        - Fill nulls for typically-text variables with empty strings.
        - One-hot encode categorical variables.

    :param df: The data.
    :return: The data without disposals.
    """
    # remove rows missing too many values
    df = df.dropna(thresh=3)

    # remove rows missing required columns
    df = df.dropna(subset=REQUIRED_COLUMNS)

    # convert natural booleans
    df = df.replace(to_replace={
        False: 0, True: 1,
        'negative': 0, 'positive': 1,
        'No': 0, 'Yes': 1,
    })

    # fill nulls for typically-binary variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-continuous variables with the median
    # df['  '] = df['  '].fillna(df['  '].median())

    # fill nulls for typically-categorical variables with default values
    # df['  '] = df['  '].fillna('Unknown')

    # fill nulls for typically-text variables with empty strings
    # df['  '] = df['  '].fillna('')

    # one-hot encode categorical variables
    df = pd.get_dummies(df, columns=['Sex'])

    # determine which features are most important
    return df


## Exploratory Data Analysis

Get that pot of water ready. It's crab cookin' time.

![Crab pot](https://chefscornerstore.com/product_images/uploaded_images/steaming-crabs.jpg)

### Load the Data from CSV

Analyzing the output here will help us revise our data cleanup and augmentation functions.

The initial data is in CSV format. We will load it into a pandas DataFrame.  
We will ultimately save it to a JSON file for easier loading in the next steps.


In [12]:
crabs = pd.read_csv(DATASET_FILE)  # load the data
display_df(crabs, show_info=True, show_missing=True, show_distinct=True)


FileNotFoundError: [Errno 2] No such file or directory: './datasets/CrabAgePrediction.csv'

### Cleanup the Data

Let's clean up the data and display it again.

#### Missing Values

No missing values! We're off to a good start with this dataset. Will crab be on the menu tonight?

#### Non-numeric Data

It looks like 'Sex' is a categorical variable. We'll need to convert this to a numeric value. Let's use **one-hot encoding**.
