### KNN - Choosing the amount of (your) neighbors carefully

Note: We focus on classification in this video.

In [85]:
import numpy as np
import pandas as pd
import seaborn as sns

iris = sns.load_dataset("iris")

In [86]:
iris.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [87]:
## Decide on features and assign X and y.
feature_columns = ['sepal_length', 'sepal_width', 'petal_length','petal_width']
X = iris[feature_columns].values
y = iris['species'].values

In [88]:
#Label Encoding. Why? Because the y (=output) is a string right now, and we cannot deal with that!
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
y = label_enc.fit_transform(y)

In [89]:
## Although we don't have to split the data, we want to compare the method to other ML techniques, so we apply the 80-20 ratio.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [90]:
# Show the first five elements, BEFORE we standardize
X_train[0:5]

array([[7.4, 2.8, 6.1, 1.9],
       [6. , 2.2, 5. , 1.5],
       [4.7, 3.2, 1.6, 0.2],
       [5.1, 3.5, 1.4, 0.2],
       [6. , 2.2, 4. , 1. ]])

In [91]:
# Scale (=standardize) the features using StandardScaler
# Based on the formula: z = (x - u) / s
# See details at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

#Import library
from sklearn.preprocessing import StandardScaler

#Initiate the object and apply scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [92]:
# Show the first five elements, AFTER we standardize
X_train[0:5]

array([[ 1.89107197, -0.54902995,  1.32384841,  0.91546477],
       [ 0.16162128, -1.91685543,  0.68491609,  0.37363605],
       [-1.44429721,  0.3628537 , -1.28996563, -1.38730727],
       [-0.95016844,  1.04676645, -1.40613514, -1.38730727],
       [ 0.16162128, -1.91685543,  0.10406852, -0.30364984]])

In [99]:
# Loading classifier libraries
from sklearn.neighbors import KNeighborsClassifier

# Set neighbors that vote on a "new" entity to 3
# No additional parameters, keeping it simple for now:
classifier = KNeighborsClassifier(n_neighbors=3)

# Fitting the model
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_predict = classifier.predict(X_test)

In [100]:
# Check our known results (the testing proportion of the data set) against the knewly predicted results
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_predict)
print("Accuracy:", accuracy *100)

Accuracy: 90.0


> "*I dont want to sit here all day, guessing an integer...* "  

-- Every IT person ever.

## What approaches do you know to find out the optimal value for k?