# H.03 | Penguins

We'll revisit our penguin friends for H.03. Over the course of this homework assignment, you will be asked to train a number of classification and regression models to predict the species of a penguin and the body mass of a penguin. We will utilize the NumPy, Pandas, and Scikit-Learn libraries to accomplish this task.

In [9]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import plotly.express as px

# Read in data and drop rows with missing values.
DATASET_URL = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv"
df = pd.read_csv(DATASET_URL)
df = df.dropna()

# Select columns that we want to use.
df = df[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "species"]]

# Drop rows with species "Gentoo" to make the dataset binary (Adelie vs Chinstrap).
df = df.query("species != 'Gentoo'")

# Display the first few rows of the dataset.
df.head()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species
0,39.1,18.7,181.0,3750.0,Adelie
1,39.5,17.4,186.0,3800.0,Adelie
2,40.3,18.0,195.0,3250.0,Adelie
4,36.7,19.3,193.0,3450.0,Adelie
5,39.3,20.6,190.0,3650.0,Adelie


## Create `X` and `y`

Recall that in supervised learning, we have a dataset consisting of both input features and output labels. The goal is to learn a model that can predict the output labels (y) from the input features (X).

In [10]:
# Create a feature matrix.
feature_df = df.drop(columns=["species"]).values

# Create a label vector.
species_labels = df["species"].values

## Binarize the Target

We will begin by binarizing the target variable. The targets include a "Chinstrap" and "Adelie" class. We will create a new target variable that is 1 if the penguin is a "Chinstrap" and 0 if the penguin is an "Adelie". Please write a function `binarize` in machine_learning.py that takes in a list of species and returns a list of 1s and 0s.

In [11]:
from machine_learning import binarize

# Binarize the labels.
binarized_labels = binarize(species_labels)

## Split the Data

We will begin by loading the Palmer Penguins dataset and splitting it into a training set and a testing set. We will not use a validation set in this homework assignment, because we don't need to tune any hyperparameters. We will use the training set to train our models and the testing set to evaluate our models.

Please write a function `split_data` using the instructions in machine_learning.py.

In [12]:
from machine_learning import split_data

x_train, x_test, y_train, y_test = split_data(feature_df, binarized_labels)

## Standardize x_train and x_test.

Please implement standard scaling to standardize the feature dataframe. Recall that standard scaling is defined as:

$$ x_{\text{standardized}} = \frac{x - \mu}{\sigma} $$

where $\mu$ is the mean of the feature and $\sigma$ is the standard deviation of the feature. Please write a function `standardize_training_data` that takes in a training set and a testing set and returns the standardized training set and testing set.


In [13]:
from machine_learning import standardize

x_train, x_test = standardize(x_train, x_test)

## KNN

Now that we have a standardized `X` and binarized `y`, let's implement our first model. We will use the K-Nearest Neighbors algorithm to predict the species of a penguin. We will use the training set to train the model and the testing set to evaluate the model.

You will be asked to implement the following common distance metrics:

1. `euclidean_distance`
2. `cosine_distance`

And implement a brute-force K-Nearest Neighbors algorithm:

3. `knn`

Please see more details in machine_learning.py. You may only use the numpy library. You **may not** use the scikit-learn library.

In [14]:
from sklearn.metrics import classification_report
from machine_learning import knn, euclidean_distance, cosine_distance

euclidean_y_pred = [knn(x = x_train, y = y_train, sample = x_test_sample, distance_method = euclidean_distance, k = 3) for x_test_sample in x_test]
cosine_y_pred = [knn(x = x_train, y = y_train, sample = x_test_sample, distance_method = cosine_distance, k = 2) for x_test_sample in x_test]

print("Euclidean Distance Classification Report")
print(classification_report(y_test, euclidean_y_pred, target_names=["Adelie", "Chinstrap"]))

print("Cosine Distance Classification Report")
print(classification_report(y_test, cosine_y_pred, target_names=["Adelie", "Chinstrap"]))


Euclidean Distance Classification Report
              precision    recall  f1-score   support

      Adelie       0.93      1.00      0.96        25
   Chinstrap       1.00      0.89      0.94        18

    accuracy                           0.95        43
   macro avg       0.96      0.94      0.95        43
weighted avg       0.96      0.95      0.95        43

Cosine Distance Classification Report
              precision    recall  f1-score   support

      Adelie       0.89      1.00      0.94        25
   Chinstrap       1.00      0.83      0.91        18

    accuracy                           0.93        43
   macro avg       0.95      0.92      0.93        43
weighted avg       0.94      0.93      0.93        43



## Linear Regression

Let's implement our second model. We will use linear regression to predict the body_mass_g of a given penguin.

You will be asked to implement the following functions:

1. `linear_regression`

Recall the equation for the normal equation:

$$ \theta = (X^T X)^{-1} X^T y $$

2. `linear_regression_predict`

Recall the equation for linear regression:

$$ \hat{y} = X \theta $$

3. `mean_squared_error`

Recall the equation for the mean squared error:
$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Please see more details in machine_learning.py. You may only use the numpy library. You **may not** use the scikit-learn library.

### First Pass

In our first pass of the model, let's quantify our performance when using only the flipper_length_mm feature.

In [15]:
from machine_learning import linear_regression, linear_regression_predict, mean_squared_error

flipper_length_train = x_train[:, -2] # flipper_length_mm
body_mass_train = x_train[:, -1] # body_mass_g

flipper_length_test = x_test[:, -2] # flipper_length_mm
body_mass_test = x_test[:, -1] # body_mass_g

linear_regression_weights = linear_regression(flipper_length_train.reshape(-1, 1), body_mass_train)
body_mass_pred = linear_regression_predict(flipper_length_test.reshape(-1, 1), linear_regression_weights)

mse = mean_squared_error(body_mass_test, body_mass_pred)

print("Mean Squared Error (only using flipper length):", round(mse, 2))

Mean Squared Error (only using flipper length): 0.87


### Visualization of First Pass

It is always a good idea to visualize our results. You don't have to do anything here. You can see that our model is doing a fine job of finding the best possible slope when only using one feature, but there is a lot of unnaccounted-for variance! In the second pass, we'll use all the features to predict the body mass of a penguin.

In [16]:
# plot demonstrating the linear regression model.
fig = px.scatter(x=flipper_length_train, y=body_mass_train, template = "plotly_white", title = "Regression")

# plot the regression line.
x_line = np.linspace(flipper_length_train.min(), flipper_length_train.max(), 100)
y_line = linear_regression_weights[1] * x_line + linear_regression_weights[0]
fig.add_scatter(x=x_line, y=y_line, mode="lines", name="Regression Line", line=dict(color="red", dash="dash"))

# plot each test sample.
fig.add_scatter(x=flipper_length_test, y=body_mass_test, mode="markers", name="Test Samples", marker=dict(size=10, color="grey"))

# plot each predicted test sample.
fig.add_scatter(x=flipper_length_test, y=body_mass_pred, mode="markers", name="Predicted Test Samples", marker=dict(size=10, color="red"))


### Second Pass

Our first pass wasn't so bad! But we can probably do better. Now, we'll include all of the remaining features (bill_length_mm, bill_depth_mm) in our model. You will see a lower mean squared error when using all the features. Pause for a moment and consider what that means!

In [17]:
from machine_learning import linear_regression, linear_regression_predict, mean_squared_error

flipper_length_train = x_train[:, :-1] # bill_length_mm, bill_depth_mm, flipper_length_mm
body_mass_train = x_train[:, -1] # body_mass_g

flipper_length_test = x_test[:, :-1] # flipper_length_mm
body_mass_test = x_test[:, -1] # body_mass_g

linear_regression_weights = linear_regression(flipper_length_train, body_mass_train)
body_mass_pred = linear_regression_predict(flipper_length_test, linear_regression_weights)
mse = mean_squared_error(body_mass_test, body_mass_pred)

print("Mean Squared Error (using flipper length, bill length, and bill depth):", round(mse, 2))

Mean Squared Error (using flipper length, bill length, and bill depth): 0.61


## Logistic Regression

Let's implement our third model. We will use logistic regression to predict the species of a penguin. We will use the training set to train the model and the testing set to evaluate the model.

You will be asked to implement the following functions:

1. `logistic_regression_gradient_descent`

Recall the equation for the gradient of the cost function:

$$ \nabla J(\theta) = \frac{1}{m} X^T (h_{\theta}(X) - y) $$

where $h_{\theta}(X)$ is the sigmoid function:

$$ h_{\theta}(X) = \sigma(X \theta) $$

2. `logistic_regression_predict`

Recall the equation for logistic regression:

$$ \hat{y} = \sigma(X \theta) $$

In [18]:
from machine_learning import logistic_regression_gradient_descent, logistic_regression_predict

weights = logistic_regression_gradient_descent(x_train, y_train)
y_pred_probabiltiies = logistic_regression_predict(x_test, weights)
y_pred = np.round(y_pred_probabiltiies)

print(classification_report(y_test, y_pred, target_names=["Adelie", "Chinstrap"]))

              precision    recall  f1-score   support

      Adelie       0.96      1.00      0.98        25
   Chinstrap       1.00      0.94      0.97        18

    accuracy                           0.98        43
   macro avg       0.98      0.97      0.98        43
weighted avg       0.98      0.98      0.98        43



In [19]:
# sort the predictions for plotting purposes.
sorted_indices = np.argsort(y_pred_probabiltiies)
predictions = y_pred_probabiltiies[sorted_indices]
labels = y_test[sorted_indices]

fig = px.scatter(y = predictions, x = list(range(len(predictions))), title = "Predicted Probabilities (Sorted)", labels = {"y": "Probability"}, template = "plotly_white")
fig.add_scatter(y = labels, x = list(range(len(labels))), mode = "markers", name = "True Labels")