# Machine Learning Fundamentals - Lecture 03

This is the Jupyter notebook for Lecture 03 of the Machine Learning Fundamentals
course.

In [1]:
# Import the required libraries using the commonly use short names (pd, sns, ...)
import numpy as np
import pandas as pd
import seaborn as sns

# The Path object from pathlib allows us to easily build paths in an
# OS-independent fashion
from pathlib import Path

# Load the required scikit-learn classes and functions
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_absolute_percentage_error

# Set a nicer style for Seaborn plots
sns.set_style("darkgrid")

## Part 1: load and clean the Pokémon dataset

Here we just repeat the steps already done in the previous lectures, but in a
more succint way.

In [2]:
# Load the dataset (note the use of the Path object)
df = pd.read_csv(Path("..", "datasets", "Pokemon.csv"))

# It's not good practice to have column names with spaces and other non-standard
# characters, so let's fix this by renaming the columns to standard names
df.rename(columns={
    "Type 1" : "Type1",
    "Type 2" : "Type2",
    "Sp. Atk" : "SpAtk",
    "Sp. Def" : "SpDef",
}, inplace=True)

# Replace missing values in the "Type2" column with the string "None"
df["Type2"] = df["Type2"].fillna("None")

# Since primary and secondary types are essentially categories (and not just
# strings / objects), we can convert these columns to the category type
df["Type1"] = df["Type1"].astype("category")
df["Type2"] = df["Type2"].astype("category")

Before we proceed to the interesting part, we'll perform our data scaling and
train/test data splitting.

In [3]:
# Let's use all features except the Total, which can be considered redundant
# since it's the total of the other features
features = ["HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]

# Get only the specified features
df_X = df[features]

# Standardize them
ss = StandardScaler()
X = ss.fit_transform(df_X)

# Our labels will be the legendary status
y_leg = df["Legendary"].to_numpy()

# Let's split our data into training (80%) and test (20%) sets
# Change the random_state parameter do split data in different ways
X_train, X_test, y_train, y_test = train_test_split(X, y_leg, test_size=0.2, random_state=42)

## Part 2: Implement our own $k$-Nearest Neighbors classifier and regressor

In [4]:
# Change this variable to change k for all the tests in this section
k_for_all = 5

### 2.1. A $k$-Nearest Neighbors classifier

Let's start with the classifier. We'll use our implementation to classify
legendary and non-legendary Pokémons, and compare our results with the actual
$k$-NN classifier provided with `scikit-learn`.

Our approach will use NumPy vectorization firstly for training your NumPy
skills, and secondly to avoid for loops and increase performance.

**Note**:

The 
[`np.partition()`](https://numpy.org/doc/stable/reference/generated/numpy.partition.html)
function rearranges an array so that the element at a chosen position ($k$) is
in the place it would be if the array was sorted. All smaller elements than that
$k$-th element go to the left, and all larger elements go to the right. The left
and right parts are not fully sorted, just partitioned.

The
[`np.argpartition()`](https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html)
function does almost the same, but instead of returning the values, it returns
the indices of how to rearrange the array to achieve that partition.

In [5]:
# Our classifier will be a simple function that does the training (fit) and
# classification (predict) in one go
def knn_classify(X_train, y_train, X_test, k=5):

    # First, we'll calculate the Euclidean distance between the test examples
    # and the training examples
    dists = euclidean_distances(X_test, X_train)

    # Now let's get the indices of the k neighbors closest to each test example
    # See above the explanation of np.argpartition()
    idx_k_min = np.argpartition(dists, k, axis=1)[:, :k]

    # For each index, get its corresponding label from the training data
    labels_k_min = y_train[idx_k_min]

    # For each test example, get the majority label

    # Instead of adding majority labels to a list as we go, let's pre-allocate a
    # numpy array for this purpose, which is much more efficient
    maj_labels = np.zeros(labels_k_min.shape[0], dtype=y_train.dtype)

    # Loop through each row of the labels matrix and determine the majority label
    for i, row in enumerate(labels_k_min):

        # Get the unique labels in the current row as well as its count
        values, counts = np.unique(row, return_counts=True)

        # Get the label with highest count and put it in our pre-allocated
        # numpy array
        maj_labels[i] = values[np.argmax(counts)]

    # Return our predictions
    return maj_labels

In [6]:
# Let's classify some test data with our classifier
y_pred_ours = knn_classify(X_train, y_train, X_test, k=k_for_all)

# And check the accuracy of our prediction
acc_ours = accuracy_score(y_test, y_pred_ours)

# Print it
print(f"Accuracy of our kNN classifier: {acc_ours}")

Accuracy of our kNN classifier: 0.925


In [7]:
# What if we use the kNN classifier from scikit-learn? What will the accuracy be?
knnClf = KNeighborsClassifier(n_neighbors=k_for_all)

# First train the classifier (fit)
knnClf.fit(X_train, y_train)

# Then perform prediction
y_pred_scl = knnClf.predict(X_test)

# And finally get the accuracy of the prediction
acc_scl = accuracy_score(y_test, y_pred_scl)

# Print it
print(f"Accuracy of scikit-learn's kNN classifier: {acc_scl}")

Accuracy of scikit-learn's kNN classifier: 0.925


In [8]:
# Just for curiosity, what's the accuracy between our classifier predictions
# and the ones from scikit-learn's?
accuracy_score(y_pred_ours, y_pred_scl)

1.0

Perfect accuracy, which means the two classifiers yield the same result! We've
just reimplemented a $k$-Nearest Neighbor classifier by "hand".

Note, however, that `scikit-learn`'s implementation is much more performant, and
has several additional options, such as configurable distance metric or giving
more weight to closer neighbors (which can be quite important, as we'll discuss
in our lecture).

### 2.2. A $k$-Nearest Neighbors regressor

Now, for the regressor. We'll use the regressor to predict the "Total" column
in the Pokémon dataset.

We'll also compare our results with the respective regressor in `scikit-learn`.

In [9]:
# Our regressor will also be a simple function that does the training (fit) and
# regression (predict) in one go
def knn_regress(X_train, y_train, X_test, k=5):

    # We similarly obtain the Euclidean distances and the indices of the k
    # neighbors closest to each test example
    dists = euclidean_distances(X_test, X_train)
    idx_mink = np.argpartition(dists, k, axis=1)[:, :k]

    # Now simply return the mean of the k closest neighbors
    return y_train[idx_mink].mean(axis=1)

In [10]:
# Since we want to predict the total, our label data (y) is the total, so let's
# get it (note that we don't need to scale the target label)
y_total = df["Total"].to_numpy()

# And now resplit the X and y into training and test data
# We need to do this again since the y labels are different (now it's the
# "Total", before it was the "Legendary" status)
X_train, X_test, y_train, y_test = train_test_split(X, y_total, test_size=0.2, random_state=42)

In [11]:
# First, let's apply our regressor and obtain the predicted totals
y_regr_ours = knn_regress(X_train, y_train, X_test, k=k_for_all)

# What's the mean absolute error of our predictions?
mean_absolute_error(y_test, y_regr_ours)

13.832500000000005

In [12]:
# Since variables and labels are scaled, it's difficult to assess how "small" or
# "large" that error was. Therefore, we can use the percentage variant of the
# mean absolute error:
mean_absolute_percentage_error(y_test, y_regr_ours)

0.031841346483808264

An absolute percentage error of 3%. Not bad. Let's check out how the $k$-NN
regressor fares in this task.

In [13]:
# Second, try out scikit-learn's kNN regressor
knnRegr = KNeighborsRegressor(n_neighbors=k_for_all)

# Train it (fit)
knnRegr.fit(X_train, y_train)

# Get prediction (perform regression)
y_regr_scl = knnRegr.predict(X_test)

# And finally, what's the mean absolute error of this regressor?
mean_absolute_error(y_test, y_regr_scl)

13.832500000000005

In [14]:
# Let's look at that error in terms of percentage
mean_absolute_percentage_error(y_test, y_regr_scl)

0.031841346483808264

In [15]:
# Just for curiosity, what's the difference between our regressor's predictions
# and the ones from scikit-learn's?
mean_absolute_error(y_regr_ours, y_regr_scl)

0.0

No difference between our regressor and `scikit-learn`'s own, which confirms
that our code is working.