## KNN In Medical Industry - Study Case Heart Disease
- In this tutorial, i will use rescaling method to improve performance of KNN.
- Before rescaling my KNN model achieve around 55% in all evaluation metrics and after Tuning Hyperparameter it performance increase to about 75%.

### Import Library

In [65]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

### Load Dataset

In [66]:
df = pd.read_csv('heart.csv')

### Sneak Peak Data

In [67]:
#See top 5 data
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [68]:
df.shape

(303, 14)

### Modelling - Standard Scaler

In [69]:
df_model = df.copy()

In [70]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = StandardScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [71]:
#Create KNN Object
knn = KNeighborsClassifier()

In [72]:
#Create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [73]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [74]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [75]:
#Predict testing set
y_pred = knn.predict(x_test)

In [76]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8524590163934426


In [77]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8383333333333333

- From value above, we can see that the performance of knn model increase to values around 85% in accuracy and about 83% in ROC with StandardScaler!

### Modelling - Robust Scaler

In [78]:
df_model = df.copy()

In [79]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = RobustScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [80]:
#create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [81]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [82]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [83]:
#Predict testing set
y_pred = knn.predict(x_test)

In [84]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8360655737704918


In [85]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8183333333333332

- With robust scaler, the accuracy is at 83% and the ROC in 81%. Its performance is under Standard Scaler.

### Modelling - MinMax Scaler

In [86]:
df_model = df.copy()

In [87]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = MinMaxScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [88]:
#create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [89]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [90]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [91]:
#Predict testing set
y_pred = knn.predict(x_test)

In [92]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8688524590163934


In [93]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8705555555555555

- With MinMax Scaler, the performance of model is highest among other rescaling method with accuracy in 86% and roc at 87%!

### Hyperparameter Tuning

In [94]:
#List Hyperparameters
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]

In [95]:
#convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

In [97]:
#Making model
clf = GridSearchCV(knn, hyperparameters, cv=10)
best_model = clf.fit(x,y)

In [99]:
#Best Hyperparameters Value
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])

Best leaf_size: 1
Best p: 1
Best n_neighbors: 9


In [100]:
#Predict testing set
y_pred = best_model.predict(x_test)

In [102]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.9180327868852459


In [103]:
#Check performance using ROC
roc_auc_score(y_test, y_pred)

0.9122222222222222

- With tuning hyperparameters + rescaling its performance achieve above 90% in accuracy and ROC!