# The One Goal For Today

To understand k-nearest neighbors as a machine learning algorithm.

# Let's talk about machine learning

The phrase "machine learning" refers to any method for approximating a solution to a problem for which we don't have an analytical solution (an algorithmic solution) through examining data. The basic taxonomy of machine learning approaches is depicted below:

![ML algorithms](https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet-2.png)

*Image from https://blogs.sas.com/content/subconsciousmusings/*

However, this diagram does not include a third major class of ML algorithm, reinforcement learning, which has been used (among many other applications!) to develop ChatGPT.

When we talk about machine learning, we talk about:
* *fitting* (or *training*) a *prediction function*, or *model*, to
* *training* data, experimenting with various
* *hyperparameters* related to the *model architecture* using held-out
* *development* data, so that the resulting model generalizes well, making good *predictions* on held-out
* *test* (or *evaluation*) data

The goal of unsupervised learning is to uncover latent structure or patterns in the data. An example of unsupervised learning is k-means clustering. The goal of supervised learning is to learn to match the labels (or answers, or ground truth, or dependent variable) in the data. An example of supervised learning is regression. 

We will spend the rest of the semester looking at three supervised ML algorithms:
* k-nearest neighbors
* naive Bayes
* radial basis function networks (a type of neural network)

If you want to investigate ML further, here is a great python library for ML:
* scikit-learn (sklearn): https://scikit-learn.org/stable/index.html
(you will use sklearn for Lab 6)

Sklearn uses a pattern; for each ML algorithm, there is a fit function (for training), a predict function (for inference or testing), and a score function (for evaluation).

We evaluate models / prediction functions using [any number of metrics](https://scikit-learn.org/stable/modules/model_evaluation.html). A commonly used one for supervised machine learning is:
* accuracy - what percent of the data points were classified correctly?

Of course, accuracy is just one number. To get a clearer understanding, we can construct a
* confusion matrix

which has the classes (the labels) along rows and columns, and in each cell indicates the number of data points classified as *row* that are truly in class *column*. 

We will look more at confusion matrices later this week.

# K-nearest neighbors

K-nearest neighbors (KNNs) is a very very simple supervised ML algorithm.
* fit - it just stores all the training data!
* predict - it finds the data points in the training data that are closest to the data point to be classified, and takes a majority vote of their labels

*closest to* means it needs a distance function.

Let's implement k-nearest neighbors!

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

Your tasks:
1. Complete any line that has a ??
2. Comment every line

In [4]:
# copy over euclidean distance from Friday
def distance(a, b):
    return np.sqrt(np.sum((a-b)**2))

# "fits" a model to the data
def fit_knn(data, labels, k):
    assert len(data) == len(labels)
    # "store" or return the model which is the combination of data, labels and k
    # see predict_one_knn for what it should look like
    return (data, labels, k)

# predict the label for one datapoint
def predict_one_knn(element, model):
    training_data = model[0]
    labels = model[1]
    k = model[2]
    # let's look up this argpartition thing
    neighbors_by_distance = np.argpartition([distance(element, datapoint) for datapoint in training_data], k)
    print("neighbors by distance: ", neighbors_by_distance)
    neighbor_labels = [labels[neighbors_by_distance[x]] for x in range(k)]
    vals, counts = np.unique(neighbor_labels, return_counts=True)
    print("neighbor labels by counts: ", vals, counts)
    return vals[np.argwhere(counts == np.max(counts))][0,0]

# predict the label for a set of data points
def predict_knn(data, model):
    return np.array([predict_one_knn(datapoint, model) for datapoint in data])
    
# calculate accuracy given actual labels y and predicted labels yhat
def accuracy(y, yhat):
    assert len(y) == len(yhat)
    diffs = y == yhat
    vals, counts = np.unique(diffs, return_counts=True)
    return (counts[np.where(vals == True)] / (np.sum(counts)))[0]

# score a model using test data
def score(model, testing_data, test_labels):
    predicted_labels = predict_knn(testing_data, model)
    return accuracy(test_labels, predicted_labels)

## Let's get some data!

The data set we wil be analyzing is a dataset of car logos from https://github.com/GeneralBlockchain/vehicle-logos-dataset. I converted each logo to greyscale and downscaled them to a consistent size. I also converted the dependent variable (manufacturer name) to an int; it is the last column.

We last used this dataset on day 19.

Your tasks:
1. Explain why we shuffle the data before splitting into train and test.
2. Explain why we split one of the variables off from the rest.

In [5]:
# load the data
data = np.array(np.genfromtxt('data/logos.csv', delimiter=',', dtype=int))  

# shuffle the data
np.random.shuffle(data)

# split the data into train and test
(train, test) = np.split(data, [int(len(data) / 10 * 8)])
print("train, test: ", train.shape, test.shape)

# separate the dependent variable from the independent variables
y_train = train[:, -1]
x_train = train[:, 0:-1]
y_test = test[:, -1]
x_test = test[:, 0:-1]
print("train, test without labels: ",x_train.shape, x_test.shape)

train, test:  (435, 1025) (109, 1025)
train, test without labels:  (435, 1024) (109, 1024)


## Let's train and evaluate a kNN model on this data!

Your task is to fill in the lines that contain ??.

First, we train.

In [6]:
%%time

# use a k of 3
model = fit_knn(x_train, y_train, 3)

CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 14.5 µs


Then, we test.

In [7]:
%%time

# call score
score(model, x_test, y_test)

neighbors by distance:  [268 362  15 239 348 308  27 224 162 206  63 229  12 377 173 335 347 263
 108  21 334  20 244 247 209  25 208  19 415 252  30  31 255  33 261 242
 408 193 192  39  40 266 402 188 267 401   1  47 182 400 273 276 396 395
 277  55 389 280 282 418 174 236 283 420 284  65 378 422 375  69  70 287
 289  73  74 290  76  77 424 370 369 160 365 363  84 232  86 159  88  89
 155 301 354 303 305  95 430 313 350 140 231 101 432 346 319 105 321 322
   2 344 343 340 112 339 327 128 332 333  59 230 117 121 116 331 124 125
 126 127 336 328   0 131 324 107 345 135 320 137 104 139  99 318 317 316
 315  97 310 147  96 306  94 353 302  91 299 355 298 297 296 356 367 295
 371  75 374 165  71 167 286 285  64 171  62  61 384  58 176  57 390  51
 275  50  48 271 184  46 186  44  43 189  41 191 405 406 194 409 410 258
 257 199 256  32 202  29 249 205  28 207 411 412 413  22 212 243 416 241
 216  16 240  13 419 221 222 233 425 427 226   4 114 434 228 227 225 223
 234 235 220 237 238 219 21

0.5596330275229358

# Questions:

*What are some of the hyperparameters related to kNN?*

*How would you pick good values for them?*

kNN is expensive in terms of computational cost and memory. You can address each of these costs using methods *we already know*.
* *To make it faster, what can you do?*
* *To make it less memory intensive, what can you do?*