In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

**KNN for regression and classification**

Previously during class, we learned about the KNN (K-Nearest Neighbour) algorithm - and how it can be used for both regression and classification.

To start using it with scikit-learn, it's as easy as any other model

*Regression*

In [None]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)                # remember, the number of neighbours is a hyperparameter that you'll have to tune

model.fit(X_train,y_train)                                # train your model on the train set

model.predict(X_test)                                     # predict on test set

*Classification*

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)    

model.fit(X_train,y_train)                               

model.predict(X_test)                                     

**Using KNN to impute missing values in our dataset**

Apart from classification and regression, there is another very popular use-case for KNN - and that is to impute missing values of features
in our dataset. 

The function is called KNNImputer, and you can read the documentation for it [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html).

Let's see how this works.

In [None]:
# create dataset for marks of a student

dict = {'Maths':[78, 90, 83, 95], 
        'Chemistry': [60, 79, 80, 82], 
        'Physics':[66, 71, 80, 78],
        'Biology' : [78,83,75,np.nan],
        'y' : ['y_1', 'y_2', 'y_3', 'y_4']}

df = pd.DataFrame(dict)
df

Let's pretend that Maths, Chemistry, Physics and Biology are the features and the column y, the target. 

The target can be anything (both discrete and continous), it doesn't matter here.

In [None]:
X, y = df.drop(columns=['y']), df['y']

X

Wee see that we have a problem in our biology column, it's missing a value. What should we replace it with?

Well, one very simple and good thing to try is to use the **other columns** as features, 
and in doing so, find the nereast neighbours to our row (3) that is missing a value in Biology.

Once we've found our nearest neighbours - we'll impute our missing value with the average of *their* values for Biology.

Scikit-learn has a neat function that does precisely this for us automatically:

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)          # find the two nearest neighbours

X_after_imputation = imputer.fit_transform(X)

X_after_imputation

As we can see above, the algorithm seem to have identified row 1 and 2 as the nearest neighbours, and imputed our missing value with their average
values for Biology.

In [None]:
(83+75)/2

**Imputing several missing values**

KNN imputer can actually handle imputing several missing values, simultaneously

In [None]:
# create dataset for marks of a student

dict = {'Maths':[78, 90, 83, 95], 
        'Chemistry': [60, 79, 80, 82], 
        'Physics':[66, 71, 80, 78],
        'Biology' : [78,83,np.nan,np.nan],
        'y' : ['y_1', 'y_2', 'y_3', 'y_4']}

df = pd.DataFrame(dict)
df

In [None]:
X, y = df.drop(columns=['y']), df['y']

X

In [None]:
imputer = KNNImputer(n_neighbors=2)          # find the two nearest neighbours

X_after_imputation = imputer.fit_transform(X)

X_after_imputation

**It can even handle missing values in multiple columns aswell**

In [None]:
# create dataset for marks of a student

dict = {'Maths':[np.nan, 90, 83, 95], 
        'Chemistry': [60, np.nan, 80, 82], 
        'Physics':[66, 71, 80, 78],
        'Biology' : [78,83,np.nan,np.nan],
        'y' : ['y_1', 'y_2', 'y_3', 'y_4']}

df = pd.DataFrame(dict)
df

In [None]:
X, y = df.drop(columns=['y']), df['y']

X

In [None]:
imputer = KNNImputer(n_neighbors=2)          # find the two nearest neighbours

X_after_imputation = imputer.fit_transform(X)

X_after_imputation

---

## Some caveats

1. Always be mindful when imputing missing values, don't just use KNNImputer mindlessly and hope for the best.
2. When using KNNImputer to fill out missing values, we don't actually need to scale our features - since it doesn't affect performance significantly.
3. However, as mentioned previously, when using KNN for regression or classification - we must scale our features before!