# Lesson 4.02 - k-Nearest Neighbours
1. KNN is a supervised classifier that memorises observations from within a test set to predict classification labels for new, unlabeled observations
2. KNN makes observations based on how similar training observations are to the new, incoming observations

In [1]:
import numpy as np
import pandas as pd
import scipy
import sklearn

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

## Importing your data

In [2]:
# load data set
address = 'data/mtcars.csv'

cars = pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

# determine X and y variables to use for model
X = cars[['mpg', 'disp', 'hp', 'wt']].values
y = cars['am'].values

# preview data set
cars.head()

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Train/Test split & Scale Data


In [3]:
# apply train-test split on data set to train model and test performance on unseen data
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=.2, random_state=17)

# apply scaling to X-variables to bring values to a similar scale for better model performance
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

## Instantiate KNN

For the `KNeighborsClassifier`, there a few important parameters to keep in mind:

1. `n_neighbors`: this is the "K" in KNN. The best K will change from problem to problem, but the default is 5.
2. `weights`: The neighbors can all have an equal vote (`uniform`), or the closer points can have a higher weighted vote (`distance`).

In the cell below, instantiate a `knn` model using the default parameters.

In [4]:
# Instantiate a class instance of KNN class with an initial parameter value
knn = KNeighborsClassifier(n_neighbors=3)

## Cross Validation
In the cell below, use `cross_val_score` to see what accuracy we can expect from our KNN model.

In [5]:
# Derive the cross-val score for training set
cross_val_score(knn, X_train_sc, y_train, cv=5).mean()

0.9417582417582417

## Model fitting and evaluation
Now that we know what we can expect from our KNN model, let's 
1. fit the model to `X_train_scaled`, `y_train`
2. score it on `X_test_scaled`, `y_test`

In [6]:
knn.fit(X_train_sc, y_train)
knn.score(X_train_sc, y_train)

0.9552238805970149

In [7]:
knn.score(X_test_sc, y_test)

0.7647058823529411

In [8]:
# We apply the cross_val score on test set to see if it can be use as the final eval metric
# The reason being knn train score = 0.95 and knn test score = 0.76 specifically imply model overfitting

# In view of the above, we print out the cross_val score for test data and compare it to train data
# In this case the knn test cross_val score of 0.9 is very close to knn train cross_val_score of 0.94.
# As such we may consider presenting the cross_val scores of train and test data as final eval metrics
cross_val_score(knn, X_test_sc, y_test, cv=5).mean()

0.9