# K Nearest Neighbors (KNN)

Classify an example to the majority class of its neighbors.

![example](https://www.researchgate.net/publication/359786522/figure/fig3/AS:1142065312346112@1649300997168/Visualization-of-k-Nearest-Neighbors-with-two-classes-blue-circles-and-red-triangles.ppm)

The KNN algorithm has no parameters. Thus, it has no real `.fit` step. It works as follows:

1. Get data points and their classes.
2. Receive a new point for classification.
3. Calculate the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between that point and every point in `1.`.
4. Take the classes of the closest `k` points.
5. Return the majority class.

# Imports

In [147]:
%reset -f

In [148]:
from collections import Counter

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,ConfusionMatrixDisplay

# Load data

In [149]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/SimeonHristov99/ML_23-24/main/DATA/iris.csv')
df_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [150]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [151]:
df_iris.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,3.054,0.433594,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,3.758667,1.76442,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,1.198667,0.763161,0.1,0.3,1.3,1.8,2.5


In [152]:
df_iris['species'].value_counts(dropna=False)

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## Data preparation

In [153]:
X = df_iris.drop('species',axis=1)
y = df_iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'{X_train.shape=}')
print(f'{X_test.shape=}')
print(f'{y_train.shape=}')
print(f'{y_test.shape=}')

X_train.shape=(120, 4)
X_test.shape=(30, 4)
y_train.shape=(120,)
y_test.shape=(30,)


## The `KNN` algorithm

In [154]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNN:
  def __init__(self, n_neighbors=3):
    self.n_neighbors = n_neighbors

  def fit(self, X, y):
    self.X_train = X
    self.y_train = y

  def predict(self, X):
    y_pred = np.array([self._predict(x) for x in X])
    return y_pred

  def _predict(self, x):
    # Compute distances between x and all examples in the training set
    distances = [euclidean_distance(x, x_train) for x_train in self.X_train]

    # Sort by distance and return indices of the first k neighbors
    k_idx = np.argsort(distances)[:self.n_neighbors]

    # Extract the labels of the k nearest neighbor training samples
    k_neighbor_labels = self.y_train[k_idx]
    
    # return the most common class label
    most_common = Counter(k_neighbor_labels).most_common(1)
    
    return most_common[0][0]

## Model performance

### Before scaling

In [155]:
knn_model_manual = KNN(n_neighbors=3)
knn_model_manual.fit(X_train.values,y_train.values)
print(classification_report(y_test,knn_model_manual.predict(X_test.values)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [156]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train,y_train)
print(classification_report(y_test,knn_model.predict(X_test)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



### After scaling

In [157]:
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
scaled_X_test

array([[ 0.35451684, -0.57925837,  0.5576453 ,  0.02332414],
       [-0.13307079,  1.67028869, -1.16259727, -1.17620281],
       [ 2.30486738, -1.02916778,  1.81915651,  1.48941263],
       [ 0.23261993, -0.35430366,  0.44296246,  0.42316645],
       [ 1.2077952 , -0.57925837,  0.61498672,  0.28988568],
       [-0.49876152,  0.77046987, -1.27728011, -1.04292204],
       [-0.2549677 , -0.35430366, -0.07311031,  0.15660491],
       [ 1.32969211,  0.09560575,  0.78701097,  1.48941263],
       [ 0.47641375, -1.9289866 ,  0.44296246,  0.42316645],
       [-0.01117388, -0.80421307,  0.09891395,  0.02332414],
       [ 0.84210448,  0.32056046,  0.78701097,  1.08957031],
       [-1.23014297, -0.12934896, -1.33462153, -1.44276436],
       [-0.37686461,  0.99542457, -1.39196294, -1.30948358],
       [-1.10824606,  0.09560575, -1.27728011, -1.44276436],
       [-0.86445224,  1.67028869, -1.27728011, -1.17620281],
       [ 0.59831066,  0.54551516,  0.5576453 ,  0.55644722],
       [ 0.84210448, -0.

In [158]:
knn_model_manual = KNN(n_neighbors=3)
knn_model_manual.fit(scaled_X_train,y_train.values)
print(classification_report(y_test,knn_model_manual.predict(scaled_X_test)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [159]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(scaled_X_train,y_train)
print(classification_report(y_test,knn_model.predict(scaled_X_test)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Apply KNN and Logistic Regression on the Titanic dataset

In [160]:
df_train = pd.read_csv('https://raw.githubusercontent.com/SimeonHristov99/ML_23-24/main/DATA/titanic_train.csv')
df_train['split'] = 'Train'
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,split
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Train
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Train
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Train
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Train
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Train
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Train
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Train
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Train
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Train


In [161]:
df_train = df_train.select_dtypes('number')
df_train = df_train.dropna()

X = df_train.drop('Survived',axis=1)
y = df_train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'{X_train.shape=}')
print(f'{X_test.shape=}')
print(f'{y_train.shape=}')
print(f'{y_test.shape=}')

scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
scaled_X_test

X_train.shape=(571, 6)
X_test.shape=(143, 6)
y_train.shape=(571,)
y_test.shape=(143,)


array([[-1.17937377e+00, -2.65489433e-01,  8.14289078e-01,
        -5.47785123e-01, -5.07336843e-01, -4.45739956e-01],
       [-1.78423275e-01, -2.65489433e-01, -1.83589054e+00,
         5.85494969e-01,  6.46803943e-01, -3.29654274e-01],
       [-1.55182047e+00, -2.65489433e-01, -6.91041266e-02,
         5.85494969e-01, -5.07336843e-01, -1.83285370e-01],
       [-3.25850092e-01, -1.45914665e+00, -4.08870744e-01,
        -5.47785123e-01, -5.07336843e-01,  6.90890288e-01],
       [ 1.41611414e+00,  9.28167783e-01,  8.82242401e-01,
        -5.47785123e-01, -5.07336843e-01, -5.77976690e-01],
       [ 3.72487463e-01, -2.65489433e-01, -1.49612392e+00,
         5.85494969e-01,  6.46803943e-01,  3.37443833e-02],
       [-1.42767157e+00,  9.28167783e-01,  2.02709167e-01,
         2.85205515e+00, -5.07336843e-01, -3.88201835e-01],
       [ 1.24540941e+00, -2.65489433e-01,  1.62972896e+00,
         5.85494969e-01,  2.95508551e+00, -2.43851813e-01],
       [-1.36947677e+00,  9.28167783e-01, -1.370

### Before scaling

In [162]:
log_model = LogisticRegression()
log_model.fit(X_train,y_train)
print(classification_report(y_test,log_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.71      0.85      0.77        87
           1       0.67      0.46      0.55        56

    accuracy                           0.70       143
   macro avg       0.69      0.66      0.66       143
weighted avg       0.69      0.70      0.69       143



In [163]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train,y_train)
print(classification_report(y_test,knn_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.66      0.77      0.71        87
           1       0.52      0.39      0.45        56

    accuracy                           0.62       143
   macro avg       0.59      0.58      0.58       143
weighted avg       0.61      0.62      0.61       143



### After scaling

Notice the increase in `KNeighborsClassifier`.

In [164]:
log_model = LogisticRegression()
log_model.fit(scaled_X_train,y_train)
print(classification_report(y_test,log_model.predict(scaled_X_test)))

              precision    recall  f1-score   support

           0       0.71      0.85      0.77        87
           1       0.67      0.46      0.55        56

    accuracy                           0.70       143
   macro avg       0.69      0.66      0.66       143
weighted avg       0.69      0.70      0.69       143



In [165]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(scaled_X_train,y_train)
print(classification_report(y_test,knn_model.predict(scaled_X_test)))

              precision    recall  f1-score   support

           0       0.69      0.76      0.73        87
           1       0.56      0.48      0.52        56

    accuracy                           0.65       143
   macro avg       0.63      0.62      0.62       143
weighted avg       0.64      0.65      0.64       143

