<img src="data/images/lecture-notebook-header.png" />

# Classification & Regression I: KNN

KNN, which stands for k-nearest neighbors, is a machine learning algorithm that can be used for both classification and regression tasks. The algorithm is a type of instance-based learning, where new data points are classified or predicted based on their similarity to known data points in the training set. Compared to most other classification of regression models, there is not any real training involved -- KNN only "remembers" all the training data -- and all the heavy lifting (i.e., finding the k-nearest neighbors) is done during prediction. In practice, that is often a bit of a problem since training is usually a one-time process but predictions are very frequent.

In the case of classification, the KNN algorithm works as follows:

* **Training Phase:** The algorithm stores the feature vectors and corresponding class labels of the training data, as well as set the user-specified value of $k$.

* **Prediction Phase:** When a new data point is provided, the algorithm identifies the $k$ nearest neighbors in the training data based on a distance metric (usually Euclidean distance). It counts the occurrences of each class label among the $k$ neighbors. Lastly, the majority class label among the $k$ neighbors is assigned to the new data point.

In regression tasks, the KNN algorithm is adapted slightly to predict continuous values rather than discrete class labels:

* **Training Phase:** Similar to classification, the algorithm stores the feature vectors and corresponding target values (continuous) of the training data; and it sets the value of $k$.

* **Prediction Phase:** When a new data point is provided, the algorithm identifies the $k$ nearest neighbors. It calculates the average (or weighted average) of the target values of these $k$ neighbors. The resulting average is assigned as the prediction for the new data point.

Key considerations when using the KNN algorithm:

* The choice of $k$ is important and can impact the performance of the algorithm. A small $k$ may lead to overfitting, while a large $k$ may introduce more bias.
* The distance metric used to calculate similarity between data points can vary based on the problem at hand.
* It is often helpful to normalize or scale the feature values to ensure that no single feature dominates the distance calculation.

Overall, the KNN algorithm is a simple yet powerful technique that can be used for classification and regression tasks, but it may not be suitable for large datasets due to its computational requirements during prediction.

In this notebook, we go through a few examples for training a KNN classifier or regressor using simple datasets.

## Setting up the Notebook

### Specify how Plots Get Rendered

In [None]:
%matplotlib inline

### Make all Required Imports. 

Many of the packages are for fancy visualization.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

from tqdm import tqdm

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn import preprocessing

from scipy.spatial import Voronoi, voronoi_plot_2d

from matplotlib.patches import Rectangle, Circle
from matplotlib.colors import ListedColormap

from sklearn.metrics import classification_report, f1_score, mean_squared_error

---

## Working with Toy Data

To understand the underlying idea and inner workings of KNN, it is easiest to visualize and understand the algorithm and its results using very simple data. In the following, we use the initial examples given in the lecture for classification and regression.

### KNN for Classification

Let's start with the classification example based on 17 data points (2 dimensional) and 2 class labels. The absolute coordinates do not really matter. However, the points are placed in such a way the unknown data for which we want to predict the class label is placed at `(0, 0)`.

#### Create and Visualize Data

In [None]:
# Define features
X = np.array([
    [-1.0, 0.0], [-2.2, 0.0], [-2.5, 1.0], [-1.3, 1.5], [-0.1, 0.5], [0.2, 1.0], [-0.8, 1.7], [0.5, 1.9],
    [0.0, -0.5], [0.4, -0.25], [-1.2, -2.5], [0.5, -2.3], [1.2, -2.1], [2.5, -1.8], [0.2, -3.0], [0.9, -2.8], [2.3, -3.0]
])

# Define labels 
y = np.array([
    1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 
])

num_samples, num_features = X.shape

print('The dataset consists of {} data points, each with {} features.'.format(num_samples, num_features))

As our data points are 2-dimensional, we can easily plot them; the color reflects the 2 different class labels. The plot also contains the unknown data point at `(0, 0)` in gray, and 2 circles reflecting the areas containing the 3 and 5-nearest neighbors of the unknown data point.

**Important:** The circles imply that the Euclidean distance metric is used. Other metrics, e.g., the Manhattan distance metric, would result in different shapes for the respective regions.

In [None]:
plt.figure()
plt.gca().set_aspect('equal')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.scatter(X[:,0], X[:,1], c=y, s=200, cmap='rainbow')
plt.scatter([0.0], [0.0], c='gray', s=200)
plt.scatter([0.0], [0.0], s=5000, facecolors='none', edgecolors='black', linestyle='--')
plt.scatter([0.0], [0.0], s=17000, facecolors='none', edgecolors='black', linestyle='--')
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show() 

The plot above already makes it obvious the prediction for the class label of the unknown data point would differ depending on $k=3$ or $k=5$. This is, of course, a general observation. It also shows that setting $k$ to an odd number is the common approach as it reduced the number of ties where the labels of k-nearest neighbors are equally split in half.

#### Train the Classifier

Now that we have convinced ourselves visually that $k=3$ and $k=5$ will yield different predictions, we can actually train a KNN classifier for the two values and predict the class label for out unknown data point at `(0, 0)`.

In [None]:
# Create and train classifier with k=3
clf = KNeighborsClassifier(n_neighbors=3).fit(X, y)

# Predict for grey data point (0, 0)
y_pred = clf.predict([[0.0, 0.0]])

print('Predicted class label: {}'.format(y_pred[0]))

In [None]:
# Create and train classifier with k=5
clf = KNeighborsClassifier(n_neighbors=5).fit(X, y)

# Predict for grey data point (0, 0)
y_pred = clf.predict([[0.0, 0.0]])

print('Predicted class label: {}'.format(y_pred[0]))

Unsurprisingly, the predicted class labels differ depending on the value of $k$. Feel free to try other values of $k$ and see if the results agree with your expectations based on the the plot above.

#### Voronoi Tessellation for k=1

As we saw in the lecture, setting $k=1$ results in a [Voronoi Tessellation](https://en.wikipedia.org/wiki/Voronoi_diagram) of the data space. Voronoi tessellation, also known as Voronoi diagram or Voronoi partition, is a mathematical concept that divides a space into regions based on the proximity to a set of predefined points called seed points or generators. Each region represents the area closest to a specific seed point compared to any other seed point. In the context of a KNN classifier with $k=1$, each region containing a data point $x$ represents the subspace in which all unknown data points will get the same class labels as $x$.

Let's plot this for your toy dataset.

In [None]:
#plt.figure()   # If the plot is empty, try uncommenting this line
vor = Voronoi(X)
voronoi_plot_2d(vor, show_vertices=False, show_points=False)
plt.scatter(X[:,0], X[:,1], c=y, s=200, cmap='rainbow')
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show()

### KNN for Regression

Let's now look at the example for regression using KNN as shown in the lecture.

In [None]:
# Define data points
data = np.array([
    [2.0, 11.0], [18.0, 9.0], [10.0, 4.0], [2.5, 9], [4, 9], [4.5, 8.5],
    [9.5, 4.5], [8.5, 5], [5.5, 5.5], [4.5, 6.5], [3.8, 6], [7.5, 6.5], [7.7, 7.3],
    [11.5, 6], [12.5, 4.5], [13.5, 4.5], [13, 3.5], [14, 6.2], [14.7, 3.7],
    [14.7, 3.7], [15.2, 6], [16.5, 7]
])

# Define feature + label
X = data[:,0].reshape(-1, 1)
y = data[:,1]

We can plot the data with some auxiliary visualization to indicate the ranges for different values of $k$. Again, the exact coordinates are not of interest here, but the unknown point for which we want to predict the $y$ value is at $x=9$.

**Note:** The dashed rectangles might be a bit misleading as only the left and right borders are relevant, but not the top and bottom one. The input is only 1-dimensional, so the distances between the data points are only calculated along the $x$ axis.

In [None]:
plt.figure()
plt.xlim([0.0, 20.0])
plt.ylim([0.0, 14.0])
plt.scatter(X, y, marker='x', s=50)
plt.gca().add_patch(Rectangle((7.8,3.5), 2.4, 2.2, facecolor='none', edgecolor='gray', linestyle='--'))
plt.gca().add_patch(Rectangle((7.2,3), 3.6, 4.5, facecolor='none', edgecolor='gray', linestyle='--'))
plt.axvline(x=9, c='black', linestyle='--')
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show() 

Again, just by looking at the plot it's easy to see that the predicted value for $k=3$ will be lower than for $k=5$ since the $y$ values of the 2 additional points are both larger than the ones for the 3 points for $k=3$. The 2 additional points will therefore raise the average.

#### Train the Regressors

Training both regressors for $k=3$ and $k=5$, and then predicting the value for the unknown point $x=9$ confirms our observation.

In [None]:
# Create and train regressor with k=3
regr = KNeighborsRegressor(n_neighbors=3).fit(X, y)

# Predict for data point reflecting the vertical dashed line in the plot
y_pred_k3 = regr.predict([[9.0]])[0]

print('Predicted class label: {}'.format(y_pred_k3))

In [None]:
# Create and train regressor with k=5
regr = KNeighborsRegressor(n_neighbors=5).fit(X, y)

# Predict for data point reflecting the vertical dashed line in the plot
y_pred_k5 = regr.predict([[9.0]])[0]

print('Predicted class label: {}'.format(y_pred_k5))

...and we can alo plot the results. The blue dot represents the predicted value for $k=3$, and the red dot represents the predicted value for $k=5$.

In [None]:
plt.figure()
plt.xlim([0.0, 20.0])
plt.ylim([0.0, 14.0])
plt.scatter(X, y, marker='x', s=50)
plt.gca().add_patch(Rectangle((7.8,3.5), 2.4, 2.2, facecolor='none', edgecolor='gray', linestyle='--'))
plt.gca().add_patch(Rectangle((7.2,3), 3.6, 4.5, facecolor='none', edgecolor='gray', linestyle='--'))
plt.axvline(x=9, c='black', linestyle='--')
plt.scatter([9], [y_pred_k3], s=100, c='blue')
plt.scatter([9], [y_pred_k5], s=100, c='red')
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show() 

---

## KNN Classification of IRIS Dataset

We use the [IRIS dataset](https://archive.ics.uci.edu/ml/datasets/iris) as a small but real-world dataset to illustrate the use of a KNN classifier. The Iris dataset is one of the most well-known and commonly used datasets in machine learning for classification tasks. It is named after the iris flower and was introduced by the statistician and biologist Ronald Fisher in 1936. The dataset is often used as a beginner's dataset for learning classification algorithms.

The Iris dataset consists of measurements of four features (attributes) of iris flowers, namely:

* Sepal Length (in centimeters)
* Sepal Width (in centimeters)
* Petal Length (in centimeters)
* Petal Width (in centimeters)

Based on these features, the dataset assigns each instance (row) to one of three classes of iris flowers: *Setosa*, *Versicolor*, and *Virginica*. The dataset contains 150 instances in total, with 50 instances per class. It is a balanced dataset, meaning that each class has an equal number of instances.

### Load Data

Using `pandas`, we first load the dataset from the comma-separated file into a DataFrame. We also perform 2 additional steps

* Convert the string class labels *Setosa*, *Versicolor*, and *Virginica* to numeric class labels 0, 1, and 2

* Shuffle the records to ensure that both training set and test feature a similar distribution (see below)

In [None]:
df = pd.read_csv('data/datasets/iris/iris.csv')

# Convert the species name to numerical categories 0, 1, 2
df['species'] = pd.factorize(df['species'])[0]

# The rows are sorted, so let's shuffle them
df = df.sample(frac=1, random_state=5).reset_index(drop=True)

# Show the first 5 columns
df.head()

### Create Training and Test Data

To allow us to visualize things more easily, we consider only 2 input features (here: sepal length and sepal width). In other words, just for the visualization, we pretend the dataset has only 2 features. We then split the dataset into 80% training data and 20% test data.

In [None]:
# Convert data to numpy arrays
X = df[['sepal_length', 'sepal_width']].to_numpy()
y = df[['species']].to_numpy().squeeze()

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.80

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(X))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = X[:train_set_size], X[train_set_size:]
y_train, y_test = y[:train_set_size], y[train_set_size:]

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))
print(len(X_test), len(y_test))

With 2-dimensional data points, we can plot the data in a straightforward manner.

In [None]:
# We re-use the color maps later for visualizing the decision boundaries
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

plt.figure()
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cmap_bold)
plt.xlabel('Sepal Length ', fontsize=16)
plt.ylabel('Sepal Width', fontsize=16)
plt.tight_layout()
plt.show()

Just from looking at the plot above we can see that the "red" class is well separated while the "green" and "blue" classes show quite some overlap. Based on this we can expect that predicting the "red" class correctly will be easier than for the "green" and "blue" class.

**Important:** This overlap between the "green" and "blue" class is only so pronounced because we have ignored 2 features. With respect to all 4 features, all 3 classes are quite separated and most classification models have no problem with that simple dataset.

### Train and Test KNN Classifier

Let's set $k=5$ and train a KNN classifier to see how well it predicts the class labels for the test data.

In [None]:
iris_classifier = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

y_pred = iris_classifier.predict(X_test)

print(classification_report(y_test, y_pred))

Precision, recall and f1-score for Class 0 are very high, which corresponds to the classification of "red" and "non-red" data points. In contrast, for Call 1 and 2 the results are not as good, as the "green" and "blue" data points show quite some overlap.

Of course, these are the results for just one setting of $k$. We can perform the same evaluation for a wide range of $k$ values to see which one yields the best average f1-score; we use `macro` in this case.

In [None]:
x = []
f1_scores_c1 = []
f1_scores_c2 = []
f1_scores_c3 = []
f1_scores_avg = []

for k in tqdm(range(1, 110)):
    x.append(k)
    iris_classifier = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
    y_pred = iris_classifier.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
    f1_scores_avg.append(report['macro avg']['f1-score'])
    f1_scores_c1.append(report['0']['f1-score'])
    f1_scores_c2.append(report['1']['f1-score'])
    f1_scores_c3.append(report['2']['f1-score'])


...and the corresponding plot:

In [None]:
plt.figure()
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('K', fontsize=16)
plt.ylabel('F1 Score', fontsize=16)
#plt.plot(x, f1_scores_micro, label='micro')
plt.plot(x, f1_scores_c1, label='Class Red', c='#FF0000', lw=2)
plt.plot(x, f1_scores_c2, label='Class Green', c='#00FF00', lw=2)
plt.plot(x, f1_scores_c3, label='Class Blue', c='#0000FF', lw=2)
plt.plot(x, f1_scores_avg, '--', label='macro', c="black", lw=3)
plt.legend(loc="lower left", prop={'size': 14})
plt.tight_layout()
plt.show()

Again, we see that the "red" class is easier to predict than the two other ones. In this case, the f1-scores (particularly the average over all 3 classes; black dashed line) are rather stable for a wider range of $k$ values. This indicates that the train data points are rather balanced and equally distributed. However, feel free to change the `random_state` in the step `df = df.sample(frac=1, random_state=5).reset_index(drop=True)`, to generate a different split into training and test data.

In general, the results are typically much more sensitive to the value of $k$.

### Plot Decision Boundaries

Plotting the decision boundaries can be done by plotting the regions with respect to the predicted class labels. Without going into much detail (as this is purely about visualization), the code below generates a fine grid/mesh of data points and predicts the class label for each data point.

In [None]:
# Defines the "resolution" of the visualization of the decision boundaried
# Smaller values yield bettern-looking boundaries but require more runtime
h = 0.01

# calculate min, max and limits
margin = 0.2
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))


X_mesh = np.c_[xx.ravel(), yy.ravel()]

# predict class using data and kNN classifier
k = 1
iris_classifier = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
Z = iris_classifier.predict(X_mesh)

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
# Plot also the training points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("IRIS 3-Class classification (k = %i)" % (k))
plt.xlabel('Sepal Length ', fontsize=16)
plt.ylabel('Sepal Width', fontsize=16)
plt.tight_layout()
plt.show()

If you have noticed, for example, for $k=1$, there is a green data point at `(6.0, 2.2)` in a blue region (among other such few cases). The reason for this is that not all data points are distinct. In this case, there are 2 data points with the same input `(6.0, 2.2)` but with different class labels. While the blue one "wins" the prediction, the green one "wins" the plotting.

---

## KNN Regression of Howell Dataset

The [Howell Dataset](https://github.com/rmcelreath/rethinking/blob/master/data/Howell1.csv), also known as the Human Height and Weight dataset, is a dataset often used in machine learning for regression tasks. It is derived from a study conducted by D.G. Howell in 1995, which focused on the relationship between the height and weight of humans.

The dataset consists of measurements of the height and weight of individuals, along with additional information such as their age and sex. It is notable because it includes data from both children and adults, providing a wide range of age groups. The primary purpose of the Howell dataset is to explore and analyze the relationship between height and weight across different age groups. It allows researchers and machine learning practitioners to build regression models to predict weight based on height or vice versa. Moreover, it can be used to examine how these relationships vary between different age groups or genders.

It is worth noting that while the Howell dataset provides valuable insights into the relationship between height and weight, it is limited in scope and may not capture all the factors influencing these variables in different populations. Therefore, it is essential to interpret and generalize the findings with caution.

### Load Data

The dataset is again given as a file -- only this time as a semicolon-separated file -- so we first use `pandas` to load it into a DataFrame. For the purpose of the subsequent examples, we consider only the data from males.

In [None]:
df = pd.read_csv('data/datasets/howell/Howell1.csv', sep=';')

# We consider only males to keep it simple
df = df[df['male'] == 1]

# The rows are sorted, so let's shuffle them
df = df.sample(frac=1, random_state=0).reset_index(drop=True)

# Show the first 5 columns
df.head()

### 1d Case: Predict Weight Based on Age

We first consider only 1 input feature (age) to predict the weight of a male. The reason is simply to allow for some visualization which would be tricky for more features.

#### Create Training and Test Data

Similar to above, we use an 80/20 split for generating the training and test data.

In [None]:
# Convert data to numpy arrays
X = df[['age']].to_numpy()
y = df[['weight']].to_numpy().squeeze()

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.8

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(X))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = X[:train_set_size], X[train_set_size:]
y_train, y_test = y[:train_set_size], y[train_set_size:]

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))
print(len(X_test), len(y_test))

In the case of such simple data, it's always good to first have a look at it.

In [None]:
plt.figure()
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Age', fontsize=16)
plt.ylabel('weight', fontsize=16)
plt.scatter(X, y)
plt.tight_layout()
plt.show()

The data looks reasonably intuitive. On average, the weight goes with increasing age and stays rather stable (but with a larger variance) for adult males.

#### Train and Test KNN Regressor

Training and testing a KNN regressor is equally straightforward as the classifier. Of course, the evaluation metric is not different. In the following, we use the Root Mean Squared Error (RMSE) for that. You can try different $k$ values and see how the RMSE changes accordingly.

In [None]:
howell_regressor = KNeighborsRegressor(n_neighbors=3).fit(X_train, y_train)

y_pred = howell_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred, squared=False)

print('The Root Mean Squared Error (RMSE) for k={}: {:.3f}'.format(k, mse))

#### Evaluate Different Values for k

As the predictions on this small dataset are very fast, we can easily evaluate the regressor for a wide range of different $k$ values.

In [None]:
ks, mses = [], []

best_k, best_mse = None, np.inf

for k in tqdm(range(1, 100)):
    ks.append(k)
    howell_regressor = KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)
    y_pred = howell_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred, squared=False)
    mses.append(mse)
    if mse < best_mse:
        best_mse = mse
        best_k = k
        
print('The lowest RSME was achieved with k={}'.format(best_k))

We can plot the choice of $k$ against the resulting RSME value.

In [None]:
plt.figure()
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('K', fontsize=16)
plt.ylabel('Rooted Mean Squared Error (RSME)', fontsize=16)
plt.plot(ks, mses, lw=3)
plt.tight_layout()
plt.show()

#### Visualize Regression Line

The following steps visualize the regression line. For this we simple predict the output for many points along the x axis.

In [None]:
k = best_k

howell_regressor = KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)

x_val, y_val = [], []

for h in tqdm(np.arange(np.min(X), np.max(X), 0.01)):
    y_pred = howell_regressor.predict(np.array([[h]]))
    x_val.append(h)
    y_val.append(y_pred[0])

...and plot the result:

In [None]:
plt.figure()
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('age', fontsize=16)
plt.ylabel('weight', fontsize=16)
plt.scatter(X_train, y_train, s=5)
plt.plot(x_val, y_val, c='red', lw=3, label='Predictions (k={})'.format(k))
plt.legend(loc="lower right",  fontsize=16)
plt.tight_layout()
plt.show()

Again, you can plot the results for different values of $k$ and see how it effects the regression line.

### 2d Case: Predict Weight Based on Age and Height

We can do the same regression task but now for two input features (age + height). The regressor, of course, does not care. The only difference is that we no longer can simply plot the respective regression line.

#### Create Training and Test Data

In [None]:
# Convert data to numpy arrays
X = df[['height', 'age']].to_numpy()
y = df[['weight']].to_numpy().squeeze()

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.80

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(X))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = X[:train_set_size], X[train_set_size:]
y_train, y_test = y[:train_set_size], y[train_set_size:]

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))
print(len(X_test), len(y_test))

Now that we have 2 input features, we now to a 3d plot for visualization.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('height')
ax.set_ylabel('age')
ax.set_zlabel('weight')
ax.scatter(X_train[:,0], X_train[:,1], y_train)
ax.view_init(20, 130)
plt.tight_layout()
plt.show() 

#### Train KNN Regressor

First, we can train a KNN regressor for different $k$ values and check the resulting RSME. First let's run a single training with $k=3$.

In [None]:
howell_regressor = KNeighborsRegressor(n_neighbors=3).fit(X_train, y_train)

y_pred = howell_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred, squared=False)

print('The Root Mean Squared Error (RMSE) for k={}: {:.3f}'.format(k, mse))

#### Evaluate Different Values for k

As for the 1d case, we can evaluate the KNN regressors for different $k$ values and plot the RSME, values.

In [None]:
ks, mses = [], []

best_k, best_mse = None, np.inf

for k in tqdm(range (1, 100)):
    ks.append(k)
    howell_regressor = KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)
    y_pred = howell_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred, squared=False)
    mses.append(mse)
    if mse < best_mse:
        best_mse = mse
        best_k = k
        
print('The lowest RSME was achieved with k={}'.format(best_k))

Like before, we can also plot the results.

In [None]:
plt.figure()
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('K', fontsize=16)
plt.ylabel('Rooted Mean Squared Error (RSME)', fontsize=16)
plt.plot(ks, mses, lw=3)
plt.tight_layout()
plt.show()

Overall, the RSME has gone down a bit compared to the 1d case. Of course, this shouldn't be surprising as we use additional input features which generally allows to make better predictions.

### Effects of Normalization/Standardization

As the last part of this notebook, let's look at the effect of normalization/standardization of the data on the classification results. Note the also effects the regression results, but the differences are more intuitive in case of classifications.

We use the IRIS dataset again.

#### Load Data

In [None]:
df_raw = pd.read_csv('data/datasets/iris/iris.csv')

# Convert the species name to numerical categories 0, 1, 2
df_raw['species'] = pd.factorize(df_raw['species'])[0]

# The rows are sorted, so let's shuffle them
df_raw = df_raw.sample(frac=1, random_state=5).reset_index(drop=True)

# Show the first 5 columns
df_raw.head()

The original dataset is as good as it gets: all 4 features are lengths measured in $cm$ and the are of similar magnitude. So let's create a modified dataset where we assume that both `petal_length` and `petal_width` have been measured in $m$.

In [None]:
# Create a copy of original daya
df_mod = df_raw.copy()

# Conver cm to m for petal length and petal width
df_mod['petal_length'] = df_mod['petal_length'] / 100
df_mod['petal_width'] = df_mod['petal_width'] / 100

df_mod.head()

#### Create Training and Test Data

We create the training and test data for both the raw and the modified input; the class labels naturally remain the same.

In [None]:
# Convert data to numpy arrays
X_raw = df_raw[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
X_mod = df_mod[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
# The class labels is of course still the same for both datasets
y = df_mod[['species']].to_numpy().squeeze()


## OPTIONAL:
## The classification task over the original IRIS dataset is too easy, most classifiers perform perfectly
## We therefore add noise in terms of additional 20 features with random values
#random_state = np.random.RandomState(0)
#n_samples, n_features = X_raw.shape
#R = random_state.randn(n_samples, 20 * n_features)
#X_raw = np.concatenate((X_raw, R), axis=1)
#X_mod = np.concatenate((X_mod, R), axis=1)

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.80

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(X_raw))

# Split data and labels into training and test data with respect to the size of the test data
X_raw_train, X_raw_test = X_raw[:train_set_size], X_raw[train_set_size:]
X_mod_train, X_mod_test = X_mod[:train_set_size], X_mod[train_set_size:]
y_train, y_test = y[:train_set_size], y[train_set_size:]

print("Size of training set: {}".format(len(X_raw_train)))
print("Size of test: {}".format(len(X_raw_test)))

#### Train and Test KNN Classifiers

We can now train and test two KNN classifiers, one for the original and one for the modified datasets. Try different $k$ values and check the differences.

In [None]:
k = 17

knn_raw = KNeighborsClassifier(n_neighbors=k).fit(X_raw_train, y_train)
knn_mod = KNeighborsClassifier(n_neighbors=k).fit(X_mod_train, y_train)

y_raw_pred = knn_raw.predict(X_raw_test)
y_mod_pred = knn_mod.predict(X_mod_test)

f1_raw = f1_score(y_test, y_raw_pred, average='macro')
f1_mod = f1_score(y_test, y_mod_pred, average='macro')

print('F1 score (raw): {:.3f}'.format(f1_raw))
print('F1 score (mod): {:.3f}'.format(f1_mod))

Overall, the classifier trained over the original dataset will yield better f1-scores. In a more systematic fashion, we can compare the f1-scores of both classifiers for different values of $k$ and plot the results.

In [None]:
ks, f1_raw, f1_mod = [], [], []

for k in tqdm(range(1, 120)):
    ks.append(k)
    
    knn_raw = KNeighborsClassifier(n_neighbors=k).fit(X_raw_train, y_train)
    knn_mod = KNeighborsClassifier(n_neighbors=k).fit(X_mod_train, y_train)

    y_raw_pred = knn_raw.predict(X_raw_test)
    y_mod_pred = knn_mod.predict(X_mod_test)

    f1_raw.append(f1_score(y_test, y_raw_pred, average='macro'))
    f1_mod.append(f1_score(y_test, y_mod_pred, average='macro'))

In [None]:
plt.figure()
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('K', fontsize=16)
plt.ylabel('F1 Score', fontsize=16)
#plt.plot(x, f1_scores_micro, label='micro')
plt.plot(ks, f1_raw, label='Original Data', c='#FF0000', lw=2)
plt.plot(ks, f1_mod, label='Modified Data', c='#00FF00', lw=2)
plt.legend(loc="upper right", prop={'size': 14})
plt.tight_layout()
plt.show()

For any meaningful values of $k$, the classifier trained over the original dataset performs better.

#### Scale Modified Data

We now assume that the modified data is the only dataset we have. We would see that the input features are of (very) different magnitude. In this case, it is typically a good approach to normalize or standardize the input features.

In the following, we use out-of-the-box methods for standardization. We covered this in more detail in the Data Preprocessing notebook.

In [None]:
scaler = preprocessing.StandardScaler().fit(X_mod_train)

X_scaled_train = scaler.transform(X_mod_train)
X_scaled_test = scaler.transform(X_mod_test)

Now we perform the same test, but here comparing a classifier trained over the original data (which we know is in good shape) to a classifier trained over the modified data but now *after* having applied standardization.

In [None]:
ks, f1_raw, f1_scaled = [], [], []

for k in tqdm(range(1, 120)):
    ks.append(k)
    
    knn_raw = KNeighborsClassifier(n_neighbors=k).fit(X_raw_train, y_train)
    knn_scaled = KNeighborsClassifier(n_neighbors=k).fit(X_scaled_train, y_train)

    y_raw_pred = knn_raw.predict(X_raw_test)
    y_scaled_pred = knn_scaled.predict(X_scaled_test)

    f1_raw.append(f1_score(y_test, y_raw_pred, average='macro'))
    f1_scaled.append(f1_score(y_test, y_scaled_pred, average='macro'))

...and plotting the results:

In [None]:
plt.figure()
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('K', fontsize=16)
plt.ylabel('F1 Score', fontsize=16)
#plt.plot(x, f1_scores_micro, label='micro')
plt.plot(ks, f1_raw, label='Original Data', c='#FF0000', lw=2)
plt.plot(ks, f1_scaled, label='Scaled Data', c='#00FF00', lw=2)
plt.legend(loc="upper right", prop={'size': 14})
plt.tight_layout()
plt.show()

As you can see, the result are now very comparable, confirming the benefits of standardization in this use case.

## Summary

The KNN (k-nearest neighbors) model is a machine learning algorithm that can be applied to both classification and regression tasks. It is a simple yet powerful method that makes predictions based on the similarity between a new data point and its k nearest neighbors in the training data.

One of the notable advantages of the KNN model is its simplicity. It is easy to understand and implement, making it an ideal choice for beginners in machine learning. Additionally, KNN does not make any assumptions about the underlying data distribution, which gives it the ability to handle a wide range of data types and structures. Moreover, KNN can handle multi-class classification problems without requiring any additional modifications.

However, the KNN model has its limitations. One drawback is its computational cost during prediction, especially for large datasets. Since the algorithm needs to calculate the distances between the new data point and all existing training points, the time complexity can become prohibitive as the dataset grows. Another consideration is the choice of the number of neighbors (k). An inappropriate value of k can lead to underfitting or overfitting, affecting the model's performance. Furthermore, KNN is sensitive to the presence of irrelevant or noisy features, as these can introduce biases in the distance calculations.

Similar to clustering, KNN relies on the explicit notion of similarity/distance between data points. As such, KNN-based classification and regression is affected by the scale and range of the different features of a dataset. Therefore, the consideration of normalization/standardization becomes crucial to ensure meaningful results.

In summary, the KNN model is a simple and flexible algorithm suitable for classification and regression tasks. Its strengths lie in its simplicity, ability to handle multi-class classification, and lack of assumptions about the data distribution. However, it also has limitations, such as its computational cost, sensitivity to the choice of k, and vulnerability to irrelevant features. Understanding these pros and cons can guide practitioners in effectively applying the KNN model and addressing its limitations in various machine learning scenarios.