# Creating an ML-Ready Dataset

The penguins dataset is a popular dataset for classification tasks. It contains information about different species of penguins, including their physical characteristics and the island they were found on. In this notebook, we will create a machine learning-ready dataset from the penguins dataset.

I have made some changes to the original penguins dataset to make our preprocessing more interesting, but the overall structure is the same as can be found online:

- bill_length_mm: Bill length (mm) of the penguin.
- bill_depth_mm: Bill depth (mm) of the penguin.
- flipper_length_mm: Flipper length (mm) of the penguin.
- body_mass_g: Body mass (g) of the penguin.
- species: Species of the penguin (Adelie, Chinstrap, Gentoo).
- island: Island where the penguin was found (Biscoe, Dream)

NOTE: This is a cute example, but the principles of data preprocessing are the same for any dataset or domain. The goal is to make the data ready for machine learning algorithms, which require numerical input and clean data.

For H.03, we are introducing python scripts which are a great way to organize your code. You will write your functions in the `preprocessing.py` file and then import them into this notebook. This is a good practice for larger projects, as it keeps your code organized and makes it easier to maintain. When you submit your code, you will only need to submit the notebook `preprocessing.py` file.

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
from IPython.display import display, Markdown
import sys
sys.path.append("..")

df = pd.read_csv("https://storage.googleapis.com/mbai-data/train_dataset.csv")
NUMERICAL_COLUMNS = [ "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]

## EDA

The first step in any ML Project is to perform Exploratory Data Analysis (EDA). This involves examining the dataset to understand its structure, identify any missing values, and visualize the data. In this case, we will use the `pandas` library to load and explore the dataset.

`df.head()` will display the first 5 rows of the dataset, allowing us to see the column names and the first few entries. This is useful for getting a quick overview of the data.

`df.describe()` will provide a summary of the dataset, including count, mean, standard deviation, min, max, and quartiles for each numerical column. This is useful for understanding the distribution and range of values in the dataset.

In [None]:
display(df.head())
display(df.describe())

Looks like we have a dataset with 4 numerical columns and 2 categorical columns. Let's plot the distribution of the numerical columns to see their distributions, since this will have an impact on the methods we choose later.

In [None]:
from plotting import plot_2x2_histograms

fig = plot_2x2_histograms(df, NUMERICAL_COLUMNS)
fig.update_layout(title_text="Histograms of Penguin Measurements", showlegend=False, template = "plotly_white")
fig.show()

These look approximately normally distributed, so we can use methods that assume normality.

## Identify and Fill Missing Values

Looking at the output from `df.head()`, we can immediately see that there are some missing values in the dataset. `ds.info()` provides a summary of the dataset, including the number of non-null values in each column. This can help identify columns with missing values (and their counts).

In [None]:
display(df.head())
display(Markdown("------"))
display(df.info())
display(Markdown("------"))

We can see that all of the numerical columns have **35 missing values**. Let's use an RandomForest imputation to fill in these missing values. In `preprocessing.py`, you will see a function called impute_numerical_values that you should fill out.

In [None]:
from preprocessing import impute_numerical_values

df[NUMERICAL_COLUMNS] = impute_numerical_values(df[NUMERICAL_COLUMNS].to_numpy())
display(df.head())

That's much better! Now we can see that the missing values have been replaced with estimates.

## Scale Numerical Data

There are several ways to scale data. In class, we covered standardization and min-max scaling and the importance of scaling in your machine learning models. In this case, we will implement both using Numpy. We will ultimately use standard scaling for all numerical columns.

Please note: This is a good opportunity to learn some basic functionality in numpy (`mean()`, `std()`, `min()`, and `max()`). There are packages that will do scaling for you, but it is important to understand how they work. In this case, we will implement both standardization and min-max scaling using numpy.

In [None]:
from preprocessing import standard_scale_with_numpy, minmax_scale_with_numpy

standard_numpy = df.copy()
minmax_numpy = df.copy()

for feature in NUMERICAL_COLUMNS:
    standard_numpy[feature] = standard_scale_with_numpy(df[feature])
    minmax_numpy[feature] = minmax_scale_with_numpy(df[feature])

display(Markdown("#### Standard Scaled Data"))
display(standard_numpy.head())
display(Markdown("#### MinMax Scaled Data"))
display(minmax_numpy.head())


## Encode Species Variable

The species variable is categorical, so we need to encode it as a numerical variable. We will use one-hot encoding to create binary columns for each species. This is a common technique for handling categorical variables in machine learning.

Pandas has a built-in function for one-hot encoding, `pd.get_dummies()`, which will create a new column for each unique value in the species column. This will effectively perform one-hot encoding. We will also drop the original species column after encoding. 

In [None]:
from preprocessing import generate_one_hot_encoding

df = standard_numpy.copy()
df = generate_one_hot_encoding(df)

display(Markdown("#### One Hot Encoded Data"))
display(df.head())

## Create Target Variable

Creating a target variable is an important step in preparing your dataset for machine learning. In this case, we will create a target variable called `island` that indicates whether the island is `Biscoe` or `Dream`. This represents a binary classification problem, where we want to predict the island based on the other features in the dataset.

In [None]:
from preprocessing import binarize_islands

df['island'] = binarize_islands(df['island'])

display(Markdown("#### Binarized Islands"))
display(df.head())

## Reorder Columns for Convention

In [None]:
from preprocessing import reorder_columns

df = reorder_columns(df)
display(Markdown("#### Reordered Columns"))
display(df.head())

## Submit

In [None]:
from submit import send_notebook

response = send_notebook("./preprocessing.py")
print(response["response"])