# SLU14 - k-Nearest Neighbors (kNN)

In this notebook we will have exercises covering the following topics:

- k-Nearest Neighbors Algorithm
- A Primer on Distance
- Some considerations about kNN
- Using kNN

In [1]:
# Place any important imports at the top of the notebook when possible
import hashlib
import json
import math
import os

import numpy as np
import pandas as pd

from sklearn import datasets

## Distances

### Exercise 1

Define a function called `euclidian`. This function should receive two arguments, `a` and `b`, which are numpy arrays with shape `(N,)`, where `N` is the number of dimensions of the inputs `a` and `b`, and calculate the euclidian distance between them.

If the two arrays don't have the same shape, return None.

In case the arguments are valid, return the euclidean distance between them.

Of course you know about the function [numpy.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html), which does exactly what we're asking here, but please take this opportunity to really understand the euclidean distance! Feel free to use it to double check your answer.

In [44]:
# implement a function called euclidean_distance

def euclidian(a, b):
    """
    Euclidean distance between two vectors.
    
    Parameters
    ----------
    a: numpy array with shape (N,)
    b: numpy array with shape (N,)
    
    Returns
    ----------
    distance: float
    """

    if a.shape[0] != b.shape[0]:
        return None
    else:
        
        return np.sqrt(((b - a)**2).sum())   

In [45]:
# Test case 1
a = np.array([1, 2, 4])
b = np.array([-1, 0, 4])

assert math.isclose(euclidian(a, b), 2.8284, rel_tol=1e-03)


# Test case 2
a = np.array([1, 2])
b = np.array([-1, 0, 4])

assert euclidian(a, b) is None
             

# Test case 3
a = np.array([1])
b = np.array([-1])

assert math.isclose(euclidian(a, b), 2.0, rel_tol=1e-03)


# Test case 4
a = np.array([0, 0])
b = np.array([2, 3])

assert math.isclose(euclidian(a, b), 3.6055, rel_tol=1e-03)


# Test case 5
a = np.array([0, 1, 2, 3, 4])
b = np.array([0, -1, -2, -3, -4])

assert math.isclose(euclidian(a, b), 10.9544, rel_tol=1e-03)

### Exercise 2

Define a function called `dot_product`. This function should receive two arguments, `a` and `b`, which are numpy arrays with shape `(N,)`, where `N` is the number of dimensions of the inputs `a` and `b`.

You can assume the two arrays have the same shape.

The function should return the dot product between the arrays.

Of course you know about the function [numpy.dot](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html), , which does exactly what we're asking here, but please take this opportunity to really understand the dot product! Feel free to use it to double check your answer.

In [46]:
# implement a function called dot_product

def dot_product(a, b):
    """
    Dot product between two vectors.
    
    Parameters
    ----------
    a: numpy array with shape (N,)
    b: numpy array with shape (N,)
    
    Returns
    ----------
    dot_product: float
    """

    return (a*b).sum()

In [47]:
tests = [
    {
        'input': [np.array([1, 2, 4]), np.array([-1, 0, 4])],
        'output_hash': 'e629fa6598d732768f7c726b4b621285f9c3b85303900aa912017db7617d8bdb'
    },
    {
        'input': [np.array([0, 0]), np.array([2, 3])],
        'output_hash': '5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9'
    },
    {
        'input': [np.array([0, 1, 2, 3, 4]), np.array([0, -1, -2, -3, -4])],
        'output_hash': '4cbaf3fbc9b6ccc6d363e9cac9d51c6d3012fc8991a30cbe952c5e92c7927d92'
    }
]

for test in tests:
    answer = dot_product(*test['input'])
    answer_hash = hashlib.sha256(bytes(str(answer), encoding='utf8')).hexdigest()
    
    assert answer_hash == test['output_hash']

### Exercise 3

Define a function called `cosine`. This function should receive two arguments, `a` and `b`, which are numpy arrays with shape `(N,)`, where `N` is the number of dimensions of the inputs `a` and `b`.

You can assume the two arrays have the same shape.

The function should return the cosine distance between the arrays.

Of course you know about the function [scipy.distance.cosine](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html), which does exactly what we're asking here, but please take this opportunity to really understand the cosine distance! Feel free to use it to double check your answer.

After you've implemented the function, take a moment to think what values can the cosine distance function return.

In [48]:
# implement a function called cosine_distance

def cosine(a, b):
    """
    Cosine distance between two vectors.
    
    Parameters
    ----------
    a: numpy array with shape (N,)
    b: numpy array with shape (N,)
    
    Returns
    ----------
    cosine: float
    """
    from numpy.linalg import norm
    
    return 1 - ((a*b).sum()) / (norm(a)*norm(b))

In [49]:
# Test case 1
a = np.array([1, 2, 4])
b = np.array([-1, 0, 4])

assert math.isclose(cosine(a, b), 0.2061, rel_tol=1e-03)


# Test case 2
a = np.array([0, 1])
b = np.array([1, 0])

assert math.isclose(cosine(a, b), 1.0, rel_tol=1e-03)


# Test case 3
a = np.array([0, 1, 2, 3, 4])
b = np.array([0, -1, -2, -3, -4])

assert math.isclose(cosine(a, b), 2.0, rel_tol=1e-03)

## Implementing the kNN algorithm

By hand! Let's do this!

![lets_do_this](media/lets_do_this.gif)

### Exercise 4

The first step is to implement a function that calculates a distance between one point and each other point in a dataset.

Let's implement a function called `calculate_distances`, that:

* receives three arguments:
    * x, which is a numpy array with shape (d,)
    * dataset, which is a numpy array with shape (N, d), where N is the dataset size
    * distance_type, which can be 'euclidean', 'cosine', 'dot'
* if distance_type is not valid (is not equal 'euclidian', 'cosine' or 'dot'), raises an Exception
* otherwise calculates the distance between x and all the points in the dataset. Depending on the distance_type, you need to use one of the functions that we implemented above.
* returns a numpy array of shape (N,) with the calculated distances

In [53]:
# implement a function called calculate_distances

def calculate_distances(x, dataset, distance_type):
    """
    Calculates a distance between a point and all the other points in a dataset.
    Supported distance functions are: euclidean, dot, cosine.
    
    Parameters
    ----------
    x: numpy array with shape (d,)
    dataset: numpy array with shape (N, d)
    distance_type: string
    
    Returns
    ----------
    distances: numpy array with shape (N,)
    """
    
    try:
        distance_type not in ['euclidian', 'cosine', 'dot']
        raise Exception("Please choose between 'euclidian', 'cosine' or 'dot' distance.")
        
    except:
        calculation = []
        for i in range(0, dataset.shape[0]):

                if distance_type == 'euclidian':
                    dist = euclidian(x,dataset[i])
                    calculation.append(dist)
                elif distance_type == 'cosine':
                    dist = cosine(x,dataset[i])
                    calculation.append(dist)
                else:
                    dist = dot_product(x,dataset[i])
                    calculation.append(dist)                

        resp = np.array(calculation)
        return resp

dataset = datasets.load_iris().data
x = np.array([4.9, 3.0, 6.1, 2.2])

# Testing with euclidean distance
distances = calculate_distances(x, dataset, 'euclidian')
print(distances[13])
print(distances[47])
print(distances[112])

5.4561891462814955
5.120546845796842
1.9949937343259998


In [32]:
def check_function(function, *args):
    try:
        function(*args)
        raise Exception(f"The function {function.__name__} was supposed to return an exception")
    except:
        pass
    
dataset = datasets.load_iris().data
x = np.array([4.9, 3.0, 6.1, 2.2])

# Testing with euclidean distance
distances = calculate_distances(x, dataset, 'euclidean')

assert isinstance(distances, np.ndarray), "The function should return a numpy array!"
assert distances.shape == (150,), "The returned numpy array has the wrong shape!"
assert math.isclose(distances[13], 5.456189, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[47], 5.120546, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[112], 1.994993, rel_tol=1e-03), "The returned numpy array has the wrong values!"

# Testing with dot product distance
distances = calculate_distances(x, dataset, 'dot')

assert isinstance(distances, np.ndarray), "The function should return a numpy array!"
assert distances.shape == (150,), "The returned numpy array has the wrong shape!"
assert math.isclose(distances[13], 37.0, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[47], 41.12, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[112], 80.49, rel_tol=1e-03), "The returned numpy array has the wrong values!"

# Testing with cosine distance
distances = calculate_distances(x, dataset, 'cosine')

assert isinstance(distances, np.ndarray), "The function should return a numpy array!"
assert distances.shape == (150,), "The returned numpy array has the wrong shape!"
assert math.isclose(distances[13], 0.202958, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[47], 0.17874, rel_tol=1e-03), "The returned numpy array has the wrong values!"
assert math.isclose(distances[112], 0.02015, rel_tol=1e-03), "The returned numpy array has the wrong values!"

# Testing with a wrong distance function name
check_function(calculate_distances, x, dataset, 'unknown_distance_function')

AssertionError: The returned numpy array has the wrong values!

### Exercise 5

Now that we have a function that calculates the distance between one point and all the other points in a dataset, we need to find the point's nearest neighbors, which are the points in the dataset for which the distance is the minimum.

In this exercise, you'll implement a function called `find_nearest_neighbors`, that:

* receives two arguments:
    * distances, which is a numpy array with distances (like the one returned in the previous question)
    * k, which is the number of nearest neighbors that we want to consider
* gets the indexes of the k smallest distances (in ascending order)
* returns a numpy array of shape (k,) with those indexes


Hint: check [numpy.argsort](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html).

In [33]:
# implement a function called find_nearest_neighbors

def find_nearest_neighbors(distances, k):
    """
    Finds the k nearest neighbors
    
    Parameters
    ----------
    distances: numpy array with shape (N,)
    k: int, the number of nearest neighbors we want to consider
    
    Returns
    ----------
    indexes: numpy array with shape (k,)
    """

    return np.array(pd.Series(distances).nsmallest(k, keep='first').index)

In [34]:
# This is to make the random predictable
np.random.seed(42)

# Test case 1
knn = find_nearest_neighbors(np.random.rand(150), 3)
assert knn.shape == (3,)
assert hashlib.sha256(bytes(knn[2])).hexdigest() == '01d448afd928065458cf670b60f5a594d735af0172c8d67f22a81680132681ca'

# Test case 2
knn = find_nearest_neighbors(np.random.rand(49), 10)
assert knn.shape == (10,)
assert hashlib.sha256(bytes(knn[5])).hexdigest() == '11e431c215c5bd334cecbd43148274edf3ffdbd6cd6479fe279577fbe5f52ce6'

### Exercise 6

Now that we have a function that gets the indexes of the k nearest neighbors, we need to get the values of those neighbors, so that afterwards we can predict the label for our point.

In this exercise, you'll implement a function called `get_neighbors_labels`, that:

* receives two arguments:
    * neighbor_indexes, which are the indexes of the k nearest neighbors (like the output of the last function)
    * y_train, which is a numpy array with the targets from the training set
* gets the values from y_train using the indexes from neighbor_indexes
* returns a numpy array of shape (k,) with those values

In [35]:
# implement a function called get_neighbors_labels

def get_neighbors_labels(y_train, neighbor_indexes):
    """
    Selects the label values from the k nearest neighbors
    
    Parameters
    ----------
    y_train: numpy array with shape (N,)
    neighbor_indexes: numpy array with shape (k,)
    
    Returns
    ----------
    labels: numpy array with shape (k,)
    """
    resp = []
    for i in range(0, neighbor_indexes.shape[0]):
        resp.append(y_train[neighbor_indexes[i]])
    
    a = np.array(resp)
    return a

np.random.seed(42) 
v = np.random.rand(10)
f = np.random.randint(0, 3, 7)
answer = get_neighbors_labels(v, f)
answer[3]
f

array([1, 0, 1, 1, 1, 1, 0])

In [36]:
np.random.seed(42) 

# Test case 1
answer = get_neighbors_labels(np.random.rand(150), np.random.randint(0, 3, 3))
assert answer.shape == (3,)
assert math.isclose(answer[0], 0.37454, rel_tol=1e-03)

# Test case 2
answer = get_neighbors_labels(np.random.rand(10), np.random.randint(0, 3, 7))
assert answer.shape == (7,)
assert math.isclose(answer[3], 0.44778, rel_tol=1e-03)

### Exercise 7

Next we need to predict a label for our point based on the labels of the nearest neighbors.

In this exercise, you'll implement a function called `predict_label_majority`, that:

* receives one argument:
    * nn_labels, which are the labels from the k nearest neighbors
* returns the most frequent label


In [37]:
# implement a function called predict_label_majority

def predict_label_majority(nn_labels):
    """
    Selects the most frequent label in nn_labels
    
    Parameters
    ----------
    nn_labels: numpy array with shape (k,)
    
    Returns
    ----------
    label: int
    """
    
    s = pd.Series(nn_labels)
    count_s = s.value_counts(sort=True)
    label = count_s[count_s == count_s.max()].sort_index()
    return int(label.index[0])
    #return int(pd.Series(nn_labels).value_counts(sort=True).index[0])

In [38]:
np.random.seed(42) 

# Test case 1
answer = predict_label_majority(np.random.randint(0, 3, 3))
assert isinstance(answer, int)
assert hashlib.sha256(bytes(answer)).hexdigest() == '96a296d224f285c67bee93c30f8a309157f0daa35dc5b87e410b78630a09cfc7'


# Test case 2
answer = predict_label_majority(np.random.randint(0, 3, 5))
assert isinstance(answer, int)
assert hashlib.sha256(bytes(answer)).hexdigest() == 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'

### Exercise 8

Finally we can put everything together and implement the knn classifier!

In this exercise, you'll implement a function called `knn_classifier`, that:

* receives five arguments:
    * x, which is a numpy array with shape (d,)
    * dataset, which is a numpy array with shape (N, d), where N is the dataset size
    * targets, which is a numpy array with shape (N,), that has the targets for each of the points in the dataset
    * k, which is the number of nearest neighbors our knn algorithm will consider
    * distance_function, which can be 'euclidean', 'cosine', 'dot'
* uses all the functions that we implemented above in order to implement a knn_classifier!

In [39]:
# implement a function called knn_classifier

def knn_classifier(x, dataset, targets, k, distance_function):
    """
    Predicts the label for a single point, given a dataset, a value for k and a distance function
    
    Parameters
    ----------
    x: numpy array with shape (d,)
    dataset: numpy array with shape (N, d)
    targets: numpy array with shape (N,)
    k: int
    distance_function: string
    
    Returns
    ----------
    label: int
    """
    distances = calculate_distances(x, dataset, distance_type)
    neighbor_indexes = find_nearest_neighbors(distances, k)
    nn_labels = get_neighbors_labels(targets, neighbor_indexes)
    label = predict_label_majority(nn_labels)
    print(label)
    return label

In [40]:
dataset = datasets.load_iris().data
targets = datasets.load_iris().target
x = np.array([4.9, 3.0, 6.1, 2.2])

tests = [
    {
        'input': [x, dataset, targets, 3, 'euclidean'],
        'expected_value': 2
    },
    {
        'input': [x, dataset, targets, 5, 'dot'],
        'expected_value': 0
    },
    {
        'input': [x, dataset, targets, 1, 'cosine'],
        'expected_value': 2
    }
]

for test in tests:
    pred_label = knn_classifier(*test['input'])
    assert isinstance(pred_label, int), "The function should return an integer!"
    assert pred_label == test['expected_value'], "The returned int has the wrong value!"

0


AssertionError: The returned int has the wrong value!

Now that we've implemented a knn classifier, let's go a bit further and implement a knn regressor!

Luckily, we can reuse most of the functions we've already implemented!

Keep up the good work, we're almost there!

![almost_there](media/almost_there.gif)

### Exercise 9

As we explained in the learning notebook, the main difference between a knn classifier and a knn regressor is the way we choose the predicted label from the labels of the nearest neighbors.

For the classifier case we used a majority vote. In the regressor case, we want to use the average value of the neighbors' labels.

In this exercise, you'll implement a function called `labels_average`, that:

* receives one argument:
    * nn_labels, which are the labels from the k nearest neighbors
* returns the average of the nearest neighbors' labels

In [59]:
# implement a function called labels_average

def labels_average(nn_labels):
    """
    Gets the average of the labels from the nearest neighbors
    
    Parameters
    ----------
    nn_labels: numpy array with shape (k,)
    
    Returns
    ----------
    label: float
    """
    
    return float(nn_labels.mean())

In [60]:
np.random.seed(42) 

label_average = labels_average(np.random.rand(3))
assert isinstance(label_average, float)
assert math.isclose(label_average, 0.685749, rel_tol=1e-04)

label_average = labels_average(np.random.rand(5))
assert isinstance(label_average, float)
assert math.isclose(label_average, 0.3669862, rel_tol=1e-04)

### Exercise 10

And we're ready to implement the knn regressor!

In this exercise, you'll implement a function called `knn_regressor`, that:

* receives five arguments:
    * x, which is a numpy array with shape (d,)
    * dataset, which is a numpy array with shape (N, d), where N is the dataset size, and d is the number of dimensions that the points in the dataset have
    * targets, which is a numpy array with shape (N,), that has the targets for each of the points in the dataset
    * k, which is the number of nearest neighbors our knn algorithm will consider
    * distance_function, which can be 'euclidean', 'cosine', 'dot'
* uses all the functions that we implemented above in order to implement a knn_regressor!

In [61]:
# implement a function called knn_classifier

def knn_regressor(x, dataset, targets, k, distance_function):
    """
    Predicts the label for a single point, given a dataset, a value for k and a distance function
    
    Parameters
    ----------
    x: numpy array with shape (d,)
    dataset: numpy array with shape (N, d)
    targets: numpy array with shape (N,)
    k: int
    distance_function: string
    
    Returns
    ----------
    label: float
    """

    distances = calculate_distances(x, dataset, distance_function)
    neighbor_indexes = find_nearest_neighbors(distances, k)
    nn_labels = get_neighbors_labels(targets, neighbor_indexes)
    label = labels_average(nn_labels)
    
    return label

In [62]:
np.random.seed(42)
dataset = datasets.load_diabetes().data
targets = datasets.load_diabetes().target
x = np.random.rand(10)

prediction = knn_regressor(x, dataset, targets, 3, 'euclidean')
assert isinstance(prediction, float)
assert math.isclose(prediction, 265.666, rel_tol=1e-04)

prediction = knn_regressor(x, dataset, targets, 5, 'dot')
assert isinstance(prediction, float)
assert math.isclose(prediction, 92.8, rel_tol=1e-04)

prediction = knn_regressor(x, dataset, targets, 1, 'cosine')
assert isinstance(prediction, float)
assert math.isclose(prediction, 264.0, rel_tol=1e-04)

AssertionError: 

**Well done!!!**

![we_did_it](media/we_did_it.gif)

Finally let's wrap this up with a couple of exercises on how to use scikit's knn models.

## Using scikit's knn models

### Exercise 11

Use a `KNeighborsClassifier` to create predictions for the [brest cancer dataset](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset).

Please read the link above in order to understand the task we're solving 

Follow the instructions in the comments in the exercise cell.

In [63]:
import numpy as np
import pandas as pd
import hashlib
import json

from scipy.spatial.distance import cosine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score

In [74]:
# We start by importing the dataset
data = datasets.load_breast_cancer()
#y = data.target

#X = data.drop('target', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

# Now do a train test split, using the train_test_split function from scikit
# Use a test_size of 0.25 and a random_state of 42
# X_train, X_test, y_train, y_test = ...


In [75]:
tests = [
    {
        'dataset_type': 'X_train',
        'dataset': X_train,
        'shape_hash': '31ffabcaf98971831a5f8ad05ba70049a86bd60bda0a971ca9691388f9f72f8b'
    },
    {
        'dataset_type': 'X_test',
        'dataset': X_test,
        'shape_hash': '747c580b9756b4741bfbe812b8ca9fd8d047a5d6f9e3ebe53d4d15117f42ec2a'
    },
    {
        'dataset_type': 'y_train',
        'dataset': y_train,
        'shape_hash': '23a4f6ee909897142105a6577ac39ff86c353b8ad0ded0bece87829bb1953a58'
    },
    {
        'dataset_type': 'y_test',
        'dataset': y_test,
        'shape_hash': '40957487610d92ca4dd2d37ec155c40d20091a504bf65270a3cd28e6863ef633'
    },
]

for test in tests:
    shape_hash = hashlib.sha256(json.dumps(test['dataset'].shape).encode()).hexdigest()

    assert isinstance(test['dataset'], np.ndarray), f"{test['dataset_type']} should be a numpy array!"
    assert shape_hash == test['shape_hash'], "The returned numpy array has the wrong shape!"

In [78]:
# Now instantiate a kNN Classifier with k=3, that uses the euclidean distance as distance function
# In scikit, the euclidean distance is the default one and goes by the name of 'minkowski'
# which is in fact a generalisation of the euclidean distance
# clf = ...
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)

# Get predictions for the test dataset
# y_pred = ...
y_pred = clf.predict(X_test)

# Measure the accuracy of your solution using scikit's accuracy_score function
# accuracy = ...
accuracy = accuracy_score(y_test, y_pred)

In [79]:
assert isinstance(clf, KNeighborsClassifier)
assert clf.n_neighbors == 3
assert clf.metric == 'minkowski'

assert isinstance(y_pred, np.ndarray)
assert y_pred.shape == (143,)

assert isinstance(accuracy, float)
assert math.isclose(accuracy, 0.930069, rel_tol=1e-04)

## Exercise 12

Now we want to see the difference if we use the cosine distance instead of the euclidean distance.

Go through the same steps as the previous exercise, but use the cosine distance as the distance metric in the knn classifier.

In [81]:
# Instantiate a kNN Classifier with k=3, that uses the cosine distance as distance function
# clf = ...
clf = KNeighborsClassifier(n_neighbors=3, metric=cosine)
clf.fit(X_train, y_train)


# Get predictions for the test dataset
# y_pred = ...
y_pred = clf.predict(X_test)

# Measure the accuracy of your solution using scikit's accuracy_score function
# accuracy = ...
accuracy = accuracy_score(y_test, y_pred)

In [82]:
assert isinstance(clf, KNeighborsClassifier)
assert clf.n_neighbors == 3
assert clf.metric == cosine

assert isinstance(y_pred, np.ndarray)
assert y_pred.shape == (143,)

assert isinstance(accuracy, float)
assert math.isclose(accuracy, 0.93706, rel_tol=1e-04)

## Exercise 13

And the last exercise. 

Try different combinations of n_neighbors and metric and choose the option with the highest accuracy:

1. n_neighbors = 7, metric = 'minkowski'
2. n_neighbors = 9, metric = 'cosine'
3. n_neighbors = 11, metric = 'minkowski'
4. n_neighbors = 11, metric = 'cosine'

Write the answer to a variable called best_parameters as an integer (1, 2, 3 or 4)

In [86]:
# Find the best combination of n_neighbors and metric
# best_parameters = ...

clf = KNeighborsClassifier(n_neighbors=7, metric= 'minkowski')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy1 = accuracy_score(y_test, y_pred)

clf = KNeighborsClassifier(n_neighbors=9, metric= 'cosine')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy2 = accuracy_score(y_test, y_pred)

clf = KNeighborsClassifier(n_neighbors=11, metric= 'minkowski')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy3 = accuracy_score(y_test, y_pred)

clf = KNeighborsClassifier(n_neighbors=11, metric= 'cosine')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy4 = accuracy_score(y_test, y_pred)

print(accuracy1)
print(accuracy2)
print(accuracy3)
print(accuracy4)


best_parameters = 3

0.958041958041958
0.9440559440559441
0.9790209790209791
0.9370629370629371


In [87]:
# Test
assert isinstance(best_parameters, int)
assert hashlib.sha256(bytes(best_parameters)).hexdigest() == '709e80c88487a2411e1ee4dfb9f22a861492d20c4765150c0c794abd70f8147c'

And we're done! Nice job ;)

![were_done](media/were_done.gif)