Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 3: k-Nearest Neighbors & Logistic Regression

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=12838) Lab session #03

Jérémie DECOCK - Adrien EHRHARDT

## Objectives

- Implement the *(k)-Nearest Neighbor(s)* algorithm
- Use it to solve classification and regression problems
- Define the decision boundaries
- Explain the weaknesses of this algorithm
- Implement Logistic Regression

## Imports and tool functions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import pandas as pd
import sklearn.neighbors
from sklearn.utils import shuffle
from scipy.spatial import Voronoi, voronoi_plot_2d

In [None]:
def gen_2d_classification_samples(n_samples: int = 20, nclass: int = 3) -> pd.DataFrame:
    """
    Generates 2-dimensional samples which belong to either 2 or 3 classes

    :param int n_samples: number of samples to draw per class
    :param int nclass: number of classes the samples belong to (either 2 or 3)
    :returns: dataframe containing X (2 coordinates x1, x2) and y (as int!)
    """
    cov = np.diag([2., 2.])

    x1 = np.random.multivariate_normal(mean=[0., 0.], cov=cov, size=n_samples)
    y1 = np.full(n_samples, 1, dtype=int)

    x2 = np.random.multivariate_normal(mean=[4., 0.], cov=cov, size=n_samples)
    y2 = np.full(n_samples, 2, dtype=int)

    x3 = np.random.multivariate_normal(mean=[2., 4.], cov=cov, size=n_samples)
    y3 = np.full(n_samples, 3, dtype=int)

    if nclass == 3:
        X = np.concatenate([x1, x2, x3])
        y = np.concatenate([y1, y2, y3])
    elif nclass == 2:
        X = np.concatenate([x1, x2])
        y = np.concatenate([y1, y2])
    else:
        raise ValueError("Only 2 or 3 classes")

    df = pd.DataFrame(X, columns=['x1', 'x2'])
    df['y'] = y

    df = shuffle(df).reset_index(drop=True)
    
    return df

In [None]:
def gen_and_plot_1d_regression_samples(n_samples : int = 40):
    """
    Generates 1-dimensional regression samples

    :param int n_samples: number of samples to draw
    :returns: dataframe containing X (1 coordinate x) and y
    """
    x = np.random.uniform(low=-10., high=10., size=n_samples)
    # This is y = 2x + 3 + epsilon, similar to lab_session_02
    y = 2. * x + 3. + np.random.normal(scale=3., size=x.shape)

    df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])
    df.plot.scatter(x='x', y='y');
    return df

In [None]:
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def plot_2d_classification_samples(dataframe: pd.DataFrame, model=None, voronoi: bool = False):
    """
    Plots the 2D classification problem, possibly with the results of a given model and the Voronoi cells.
    """
    plt.figure(figsize=(8, 8))
    df = dataframe  # make an alias so as not to modify dataframe
    
    ERROR_MSG1 = "The `dataframe` parameter should be a Pandas DataFrame having the following columns: ['x1', 'x2', 'y']"
    assert df.columns.values.tolist() == ['x1', 'x2', 'y'], ERROR_MSG1
    
    ERROR_MSG2 = "The `dataframe` parameter should be a Pandas DataFrame having the following labels (in column 'y'): [1, 2, 3]"
    labels = pd.unique(df.y).tolist()
    labels.sort()
    assert labels == [1, 2, 3] or labels == [1, 3] or labels == [1, 2], ERROR_MSG2

    if model is not None:
        if voronoi:
            # Compute the Voronoi cells            
            vor = Voronoi(df[['x1', 'x2']])

            # Plot the Voronoi diagram
            fig = voronoi_plot_2d(vor, show_vertices=False, show_points=False);
            fig.set_size_inches(8, 8);
        
        # Compute the model's decision boundaries
        h = .02  # step size in the mesh
        x_min, x_max = df.x1.min() - 1, df.x1.max() + 1
        y_min, y_max = df.x2.min() - 1, df.x2.max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                             np.arange(y_min, y_max, h))
        Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot the model's decision boundaries
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.5)

    # Plot also the training points
    plt.scatter(df.x1, df.x2, c=df.y, cmap=cmap_bold, edgecolor='k', s=30)
    plt.xlabel(r"$x_1$", fontsize=16)
    plt.ylabel(r"$x_2$", fontsize=16)

## Nearest Neighbor algorithm

Today you will implement one of the simplest (but quite powerful) machine learning algorithm: the *Nearest Neighbor* algorithm and its extension the *k-Nearest Neighbors* algorithm (or *kNN*). Both can be used for classification and regression tasks.

(We'll also cover Logistic Regression in the last parts of the lab - exactly the reverse of the lectures.)

Considering a dataset $\mathcal{D}=\{(\boldsymbol{x}_i, y_i)_{i=1,\dots,n}\}$ of $n$ labeled examples, the *Nearest Neighbor* model assigns an input vector $\boldsymbol{x}$ (of dimension $p$) to the label $y_{{\arg\!\min}_{i=1,\dots, n}d(x, x_i)}$ of its closest neighbor in $\mathcal{D}$.

The closest neighbor is defined w.r.t. a distance function $d$. This can be any metric measure, but the *Minkowski distance* (especially the classical Euclidian distance $d_2$) is the most common choice. It is defined as follows:

$$d_q: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R}$$

$$d_q(\boldsymbol{u}, \boldsymbol{v}) = ||\boldsymbol{u} - \boldsymbol{v}||_q = \left( \sum_{j=1}^p |u_j - v_j|^q \right)^{1/q}$$

When $q=2$, $d_q$ is the *Euclidian distance*

$$d_2(\boldsymbol{u}, \boldsymbol{v}) = \sqrt{\sum_{j=1}^{p} (u_j - v_j)^2}$$

When $q=1$, $d_q$ is the *Manhattan distance*

$$d_1(\boldsymbol{u}, \boldsymbol{v}) = \sum_{j=1}^{p} |u_j - v_j|$$

When $q=\infty$, $d_q$ is the  *Tchebychev distance*

$$d_{\infty}(\boldsymbol{u}, \boldsymbol{v}) = \max_{j=1,\dots,p} |u_j - v_j|$$

### Exercise 1

We consider the following dataset (where `x1` and `x2` are examples' features and where `y` represents the examples' labels):

In [None]:
data = [[0, 0, 1],
        [0, 1, 1],
        [1, 1, 2],
        [1, 0, 3]]

df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df

In [None]:
plot_2d_classification_samples(df)

#### Question 1

Which label (red/1, green/2, blue/3) will be predicted by the Nearest Neighbor algorithm for the point $x = \pmatrix{0 \\ 0.5}$ ?

YOUR ANSWER HERE

#### Question 2

If you have $n$ examples in $p$ dimensions in the dataset, what is the training error of the Nearest Neighbor algorithm (in classification)? Why?

YOUR ANSWER HERE

### Exercise 2

Consider this new dataset (where `volume (mL)` and `caffeine (g)` are the examples' features and where `drink` is their label):

In [None]:
data = [[250, 0.025, 'tea'],
        [100, 0.01,  'tea'],
        [125, 0.05,  'coffee'],
        [250, 0.1,   'coffee']]

df = pd.DataFrame(data, columns=['volume (mL)', 'caffeine (g)', 'drink'])
df

#### Question 1

Use the Nearest Neighbor method to predict the label of a 125mL drink having 0.015g of caffeine, by intuition, calculation, and / or with some code (up to you, with some justification).

YOUR ANSWER HERE

In [None]:
# Optional: provide some code to prove your answer.
# YOUR CODE HERE
raise NotImplementedError()

#### Question 2

What is wrong with this prediction? How to solve this problem?

YOUR ANSWER HERE

## Nearest Neighbor method with Scikit Learn

Let's play with the Scikit Learn implementation of the Nearest Neighbor algorithm.
The official documentation is there: https://scikit-learn.org/stable/modules/neighbors.html

### Classification

We begin with a "toy" **classification problem**.

Use the `gen_2d_classification_samples()` function (defined above) to generate a dataset.

In [None]:
df = gen_2d_classification_samples(n_samples=20)
df.head()

Here, examples are defined in $\mathbb{R}^2$ (features are stored in columns `x1` and `x2`).
Examples' labels are defined in the `y` column. This is similar to Exercise 1.

The `y` column contains three possible labels: `1`, `2` and `3` respectively represented by the red, green and blue colors in the following figure.

In [None]:
plot_2d_classification_samples(df)

Thus this toy problem is a multiclass classification problem.

Once the dataset is ready, let's make the classifier and train it with the following code:

In [None]:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)  # 1-NN as a special case of k-NN

In [None]:
model.fit(X=df[['x1', 'x2']], y=df['y'])

### Exercise 3

#### Question 1

Use the `model.predict()` function to guess the class of the following points:

$$x_{p1} = \pmatrix{-2 \\ 2}, x_{p2} = \pmatrix{2 \\ 6}, x_{p3} = \pmatrix{6 \\ 0}$$

Store the result in `model_predictions`.

In [None]:
# model_predictions = ...
# YOUR CODE HERE
raise NotImplementedError()

#### Question 2

Is the training step (`model.fit()` function) longer to execute than the prediction step (`model.predict()` function)? Why?

*An intuitive answer is expected; you don't have to time any code execution.*

YOUR ANSWER HERE

#### Question 3

The next cell shows the decision boundary of the model. Explain what is a decision boundary in classification.

In [None]:
plot_2d_classification_samples(df, model=model)

YOUR ANSWER HERE

#### Question 4

The next cell generates the *Voronoï diagram* of the dataset. The Voronoï diagram makes a partition of the feature space $\mathcal{X}$.
Each partition is a *cell*. What do cells represent?
What does this figure illustrate about the Nearest Neighbor method?

In [None]:
plot_2d_classification_samples(df, model=model, voronoi=True);

YOUR ANSWER HERE

### Regression

After the "toy" classification problem, let's work on a toy **regression problem**.

The next cell generates a dataset (where 'x' is the feature and 'y' the label to predict).

In [None]:
df = gen_and_plot_1d_regression_samples()

Once the dataset is ready, let's make the regressor and train it with the following code:

In [None]:
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=1)

In [None]:
model.fit(df[['x']], df['y'])

### Exercise 4

Use the `model.predict()` function to guess the class of the following points:

$$x_{p1} = \pmatrix{-2}, x_{p2} = \pmatrix{2}, x_{p3} = \pmatrix{6}$$

Store it in `model_predictions_bis`.

In [None]:
# model_predictions_bis = ...
# YOUR CODE HERE
raise NotImplementedError()

### Plot the model's decision function

In [None]:
x_pred = np.arange(-10, 10, 0.1).reshape(-1, 1)
y_pred = model.predict(x_pred)

df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])

ax = df.plot.scatter(x='x', y='y')
df_pred.plot(x='x', y='y', style='r--', ax=ax);

### Exercise 5

Do you think this model *generalizes* well (by generalization, we mean performance on unseen - test - examples drawn from the same distribution as the seen - training - examples; recall what happened with polynomial regression with a high degree)? Why?

YOUR ANSWER HERE

## k-Nearest Neighbors algorithm

The *Nearest Neighbor* method is very sensitive to noise: if an example in $\mathcal{D}$ is wrongly labeled or positioned, all points in its Voronoï cell will be wrong too. The *k Neareast Neighbor* fix this weakness by considering for each prediction the label of several neighbors instead of just one.

Considering a dataset $\mathcal{D}=\{(\boldsymbol{x}_i, y_i)_{i=1,\dots,n}\}$ of $n$ labeled examples and a meta / hyper parameter $k \in \mathbb{N}*$, the *$k$ Nearest Neighbors* model assigns an input vector $\boldsymbol{x}$ to the label $y$ (defined below) of its $k$ closest neighbor in $\mathcal{D}$.
Let's write $\mathcal{N}_k(\boldsymbol{x})$ the set of the $k$ nearest neighbors of $\boldsymbol{x}$ in $\mathcal{D}$.

- For classification problems, the label assigned to $\boldsymbol{x}$ is the **most represented label** among the nearest neighbors (majority vote)
$$f(\boldsymbol{x}) = {\arg\!\max}_c \sum_{i: x_i \in \mathcal{N}_k(\boldsymbol{x})} \delta(y_i, c)$$

- For regression problems, the label assigned to $\boldsymbol{x}$ is computed based on the **mean** of the labels of its nearest neighbors $\mathcal{N}_k(\boldsymbol{x})$
$$f(\boldsymbol{x}) = \frac{1}{k} \sum_{i: x_i \in \mathcal{N}_k(\boldsymbol{x})} y_i$$

### Exercise 6

We consider the following dataset (where `x1` and `x2` are the example features and where `y` is the example label):

In [None]:
data = [[1, 2, '+'],
        [2, 1, '+'],
        [2, 2, '-'],
        [2, 3, '+'],
        [3, 1, '-'],
        [3, 2, '+']]

df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df

#### Question 1

Draw this dataset (it is OK to draw it on a sheet of paper: empty the code cell, add a Markdown cell and upload your picture; you can also make use of `df.plot`).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Question 2

Draw the decision boundary of a Nearest Neighbor model (i.e. 1NN - also OK on a sheet of paper).

You may need to convert `y` to an integer type and make good use of `KNeighborsClassifier` (see Exercise 2).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Question 3

Draw the decision boundary of a 3 Nearest Neighbor model (i.e. 3NN), either with code or on a sheet of paper.

*Hint:* The `n_neighbors` parameter provided to the model's constructor `KNeighborsClassifier` sets the number of neighbors to consider for each prediction (i.e. `n_neighbors` this is the '$k$' of kNN).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Question 4

How many errors these two classifiers make on the training dataset (i.e. on the provided points)?

YOUR ANSWER HERE

YOUR ANSWER HERE

#### Question 5

Which label is predicted by these two classifiers for the point $x = \pmatrix{4 \\ 0.5}$?

YOUR ANSWER HERE

## k-Nearest Neighbor (kNN) with Scikit Learn

### Classification

First we regenerate the dataset used throughout Exercise 3.

In [None]:
df = gen_2d_classification_samples()

In [None]:
plot_2d_classification_samples(df)

Then we instantiate the classifier, train it and plot the decision boundaries:

In [None]:
def learn_knn_and_plot(**kwargs):
    """
    Learns a knn model and plot the points, their class and the decision boundaries

    :param kwargs: keyword arguments passed to KNeighborsClassifier
    """
    model = sklearn.neighbors.KNeighborsClassifier(**kwargs)
    model.fit(df[['x1', 'x2']], df['y'])
    plot_2d_classification_samples(df, model=model)

In [None]:
learn_knn_and_plot(n_neighbors=5)

### Exercise 7

#### Question 1

Change the value of the hyperparameter $k$ in the cell above, and observe what happens, i.e. plot the resulting boundaries with the subsequent cell.

What is the influence of the number of neighbors on the boundaries? (Bonus points for the two extreme cases!)

In [None]:
# learn_knn_and_plot(...)
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

#### Question 2

When you face a very noisy dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?

YOUR ANSWER HERE

#### Question 3

Is the Voronoi diagram useful for the kNN case (i.e. when $k>1$) ?

YOUR ANSWER HERE

#### Question 4

Plot the decision boundary with $k=2$ and describe what happens in case of equal vote (copy-paste the 3 previous lines of code).

In [None]:
# learn_knn_and_plot(...)
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

#### Question 5

Add the `weights = "distance"` parameter in `KNeighborsClassifier`'s constructor. What changes can you observe on the decision boundary? Explain how labels are computed with this new parameter.

In [None]:
# learn_knn_and_plot(...)
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

### Regression

First we regenerate the dataset from Exercise 4 - Regression with 1-NN).

In [None]:
df = gen_and_plot_1d_regression_samples()

Then we instantiate the classifier, train it and plot the decision boundaries:

In [None]:
def train_and_plot_knn_regressor(**kwargs):
    """
    Instantiate, fits a KNN regressor and plots the training points as well as the predictions

    :param kwargs: keyword arguments passed to KNeighborsRegressor constructor
    """
    model = sklearn.neighbors.KNeighborsRegressor(**kwargs)
    model.fit(df[['x']], df['y'])

    x_pred = np.arange(-10, 10, 1).reshape(-1, 1)
    y_pred = model.predict(x_pred)

    df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])

    ax = df.plot.scatter(x='x', y='y')
    df_pred.plot(x='x', y='y', style='r--', ax=ax);

In [None]:
train_and_plot_knn_regressor(n_neighbors=10)

### Exercise 8

*Recall*: The `n_neighbors` parameter provided to the model's constructor `KNeighborsClassifier` sets the number of neighbors to consider for each prediction (i.e. `n_neighbors` this is the '$k$' of kNN).

#### Question 1

Change the value of this parameter and observe what happens.

What is the influence of the number of neighbors on the decision function (again, bonus points for extreme cases)?

In [None]:
# train_and_plot_knn_regressor(...)
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

#### Question 2

When you face a very noised dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?

YOUR ANSWER HERE

## Logistic Regression (Gradient Descent)

$k$-NN is a **non-parametric**, **classification** (with arbitrary number of classes) and **regression** algorithm.

"Natively", logistic regression is a **parametric binary classification algorithm**, so we generate a dataframe `df` containing a label `y` with only 2 classes (`0` and `1`).

In [None]:
df = gen_2d_classification_samples(n_samples = 100, nclass = 2)

In [None]:
plot_2d_classification_samples(df)

Logistic regression is very similar to linear regression in so far as it is a **parametric model** which aims at finding a parameter $\theta^\star$ such as $f_{\boldsymbol{\theta}}: \boldsymbol{x} \mapsto y$ provides a good "link" between input vectors $\boldsymbol{x} \in \mathbb{R}^p$ and output values $y \in \{0,1\}$ in a *learning set* $\mathcal{D} = \{(\boldsymbol{x}^{(i)}, y^{(i)})\}_{1 \leq i \leq n}$ of $n$ examples:

$$
\theta^\star = \arg\!\min E(\theta, \mathcal{D}),
$$

where $E(\theta, \mathcal{D}) = - \sum_{i=1}^n \ln p_{\theta}(y^{(i)} | \boldsymbol{x}^{(i)})$.

This is called the log loss (machine learning community), or the opposite of the loglikelihood (statistics community).

Fortunately, logistic regression is a rather simple model which states:

$$
p_{\theta}(1 | \boldsymbol{x}^{(i)}) = \frac{1}{1+\exp{(-\theta^T \boldsymbol{x}^{(i)})}}
$$

The loss function $E(\theta, \mathcal{D})$ is convex, which means $\theta^\star$ exists, is unique, and can be obtained by minimizing $E$.

### Exercise 9

Complete the following `gradient_descent` function implemented in the last lab to work for logistic regression. 

**Hint**: only the gradient needs to be changed.

In [None]:
X = np.array([np.ones((df.shape[0])), df.x1, df.x2])
y = np.array(df.y) - 1

In [None]:
def gradient_descent(X, y, eta=0.001, max_iteration=10000, initial_theta=None):

    if initial_theta is None:
        # The initial solution is selected randomly
        theta = np.random.normal(loc=0, scale=10, size=[3, 1])
    else:
        theta = initial_theta

    grad_list = []      # Keep the gradient of all iterations
    theta_list = []     # Keep the solution of all iterations

    for i in range(max_iteration):
        # Perform the gradient descent here
        # YOUR CODE HERE
        raise NotImplementedError()

    return grad_list, theta_list

(You don't have to fill in anything in the following cell; it's a general comment about the solution of the previous cell which will be provided in the solutions of this lab session.)

YOUR ANSWER HERE

In [None]:
grad_list, theta_list = gradient_descent(X, y, eta = 0.1)

Let's see if it has converged by plotting the parameters w.r.t. the iteration number:

In [None]:
plt.plot([theta[0] for theta in theta_list]);
plt.plot([theta[1] for theta in theta_list]);
plt.plot([theta[2] for theta in theta_list]);

Let's see the decision boundary, i.e. the half-spaces where each label is predicted.

In [None]:
x_min, x_max = np.array((df.x1, df.x2))[0, :].min() - .5, np.array((df.x1, df.x2))[0, :].max() + .5
y_min, y_max = np.array((df.x1, df.x2))[1, :].min() - .5, np.array((df.x1, df.x2))[1, :].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = 1/(1+np.exp(-np.dot(np.c_[np.ones((len(xx.ravel()))), xx.ravel(), yy.ravel()], theta_list[9999])))
plt.pcolormesh(xx, yy, (Z.reshape(xx.shape) > 0.5)*1, cmap=plt.cm.Paired)
plt.scatter(df.x1, df.x2, c=y);

## Logistic Regression (scikit-learn)

### Exercise 10

Let's do the same using scikit-learn!

Similar to the previous lab, logistic regression belongs to the linear model module, and provides among others the `fit` and `predict` methods. Use the `fit` method to compare the coefficients obtained with scikit-learn and your gradient descent.

**Beware of the `C` parameter**, which corresponds to L2 regularization (see [docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)) - this will be the subject of subsequent lectures; set it to an arbitrarily high value, e.g. `1e9`.

In [None]:
# model = sklearn.linear_model.LogisticRegression(...)
# model.fit(...)

# YOUR CODE HERE
raise NotImplementedError()

Let's see if we came up with the same estimates:

In [None]:
print("Intercept")
print("=========")
print("sklearn:\t\t", model.intercept_[0])
print("gradient descent: \t", theta_list[9999][0])
print("\n")

print("theta_1")
print("=========")
print("sklearn:\t\t", model.coef_[0][0])
print("gradient descent:\t", theta_list[9999][1])
print("\n")

print("theta_2")
print("=========")
print("sklearn:\t\t", model.coef_[0][1])
print("gradient descent:\t", theta_list[9999][2])

## Bonus

### Exercise 11

Solve the Titanic problem with the k Nearest Neighbors method (see [`lab_session_01`](https://htmlpreview.github.io/?https://github.com/adimajo/CSE204-2021/blob/master/lab_session_01/lab_session_01.html)). Reuse the code of the first lab session:
* read the data with `read_csv`;
* select the columns useful for prediction;
* drop the missing values;
* map the categorical columns to numerical values;
* split into a training and a test subset;
* instantiate a k-NN classifier named `knn_sklearn`;
* fit the model;
* compute an accuracy score.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Exercise 12

Write your own implementation for the k Nearest Neighbor algorithm.
Write a `knn()` function that takes two arguments:
- `xtrain`: the observed dataset;
- `ytrain`: the observed labels;
- `xpred`: a list of examples to predict;
- `n_neighbors`: the number of nearest neighbors to use.

This function should return the sequence of predicted labels.

In [None]:
from math import sqrt


def knn(xtrain, ytrain, xpred, n_neighbors=5):
    """
    Predicts the y values of xpred given xtrain, ytrain, and n_neighbors-nn classification

    :param pandas.DataFrame xtrain: the training set's features (you can use numpy arrays as well)
    :param pandas.DataFrame ytrain: the training set's labels (you can use numpy arrays as well)
    :param pandas.DataFrame xpred: the test set's features (you can use numpy arrays as well)
    :param int n_neighbors: number of nearest neighbors to use
    """
    # So as not to mess up with the original dataframes
    xtrain_cpy = xtrain.copy()
    ytrain_cpy = ytrain.copy()
    xpred_cpy = xpred.copy()
    # Store the distances in a matrix
    distances = np.zeros((xtrain_cpy.shape[0], xpred_cpy.shape[0]))
    # Store the predictions in a vector
    ypred = np.zeros(xpred_cpy.shape[0])
    # You might want to reset the index (to have to correct row numbers)
    xtrain_cpy.reset_index(inplace=True, drop=True)
    xpred_cpy.reset_index(inplace=True, drop=True)
    
    # Compute distances of each row x in xtrain to each row x' in xpred and put it in `distances`
    # YOUR CODE HERE
    raise NotImplementedError()

    # Average the labels of the `n_neighbors` closest points in xtrain of each row x in xpred
    # YOUR CODE HERE
    raise NotImplementedError()

    return(ypred)

In [None]:
# It is assumed you defined X_train, Y_train, X_test in Exercise 11.
your_predictions = knn(X_train, Y_train, X_test)

In [None]:
# It is assumed you "trained" knn_sklearn in Exercise 11.
sklearn_predictions = knn_sklearn.predict(X_test)

Are the two predictions the same?

In [None]:
assert (your_predictions == sklearn_predictions).all()