KNN Imputation

## Euclidean Distance
* This distance is based on **Pythagorean theorem** which was **AC^2 = AB^2 + BC^2**
* Euclidean distance is the shortest path between source and destination which is a straight line as shown in Figure 
<img src="images/Euclidean-distance-in-tensorflow.png" height=400 width=400 /> <img src="images/Manhattan Distance.gif" />
#### When to use?
* Now suppose i fly from one place to another so when i want to find out the distance between actual place and previous place from where i get flight so here i use the euclidean distance becuase this is straight forward.

## Manhattan Distance
* The Manhattan distance, also called the Taxicab distance or the City Block distance because it's calculates the distance between two real-valued vectors.
#### When to use?
Now suppose i cover the distance from one place to another , now this time i am not fly i just cover distance in any cab car so i am on the earth now so in this situation i can't calculate the distance directly because there is buildings and houses in between my way so i calculate it block wise . so in this situation i used the *Manhattan* method to calculate the distance.
* Manhattan distance is sum of all the real distances between source(s) and destination(d) and each distance are always the straight lines as shown in Figure
<img src="images/Manhattan formula.png" />

**dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/home/saad/Downloads/tested.csv', usecols=['Age','Pclass','Fare','Survived'])
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,34.5,7.8292
1,1,3,47.0,7.0
2,0,2,62.0,9.6875
3,0,3,27.0,8.6625
4,1,3,22.0,12.2875


In [3]:
df.isnull().sum()

Survived     0
Pclass       0
Age         86
Fare         1
dtype: int64

In [4]:
df['Fare'].fillna(df['Fare'].median(), inplace=True)

In [5]:
df.isnull().mean()*100

Survived     0.000000
Pclass       0.000000
Age         20.574163
Fare         0.000000
dtype: float64

In [6]:
X = df.drop('Survived', axis=1)
y = df['Survived']

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

In [8]:
X_train.head()

Unnamed: 0,Pclass,Age,Fare
280,3,23.0,8.6625
284,3,2.0,20.2125
40,3,39.0,13.4167
17,3,21.0,7.225
362,2,31.0,21.0


In [9]:
knn = KNNImputer()

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [10]:
pd.DataFrame(X_train_trf, columns=X_train.columns)

Unnamed: 0,Pclass,Age,Fare
0,3.0,23.0,8.6625
1,3.0,2.0,20.2125
2,3.0,39.0,13.4167
3,3.0,21.0,7.2250
4,2.0,31.0,21.0000
...,...,...,...
329,3.0,29.0,7.8542
330,1.0,29.8,31.6833
331,3.0,29.0,7.9250
332,2.0,24.0,27.7208


In [11]:
lr = LogisticRegression()

lr.fit(X_train_trf, y_train)
y_pred = lr.predict(X_test_trf)

accuracy_score(y_test, y_pred)

0.6547619047619048

In [12]:
knn = KNNImputer(n_neighbors=5, weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [13]:
lr = LogisticRegression()

lr.fit(X_train_trf, y_train)
y_pred = lr.predict(X_test_trf)

accuracy_score(y_test, y_pred)

0.6547619047619048

In [14]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [15]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6428571428571429

## Advantage
* It give most accurate result

## Disadvantage
* It's take very space of memory and the running of model becomes slow and take more time
* On the production time we should provide completer training data which take huge memory