## KNN In Medical Industry - Study Case Heart Disease
- In this tutorial, i will use rescaling method to improve performance of KNN.
- Before rescaling my KNN model achieve around 55% in all evaluation metrics and after Tuning Hyperparameter it performance increase to about 75%.

### Import Library

In [1]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

### Load Dataset

In [2]:
df = pd.read_csv('heart.csv')

### Sneak Peak Data

In [3]:
#See top 5 data
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
df.shape

(303, 14)

### Modelling - Standard Scaler

In [5]:
df_model = df.copy()

In [6]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = StandardScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [7]:
#Create KNN Object
knn = KNeighborsClassifier()

In [8]:
#Create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [9]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [10]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [11]:
#Predict testing set
y_pred = knn.predict(x_test)

In [12]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8524590163934426


In [13]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8383333333333333

- From value above, we can see that the performance of knn model increase to values around 85% in accuracy and about 83% in ROC with StandardScaler!

### Modelling - Robust Scaler

In [14]:
df_model = df.copy()

In [15]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = RobustScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [16]:
#create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [17]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [18]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [19]:
#Predict testing set
y_pred = knn.predict(x_test)

In [20]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8360655737704918


In [21]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8183333333333332

- With robust scaler, the accuracy is at 83% and the ROC in 81%. Its performance is under Standard Scaler.

### Modelling - MinMax Scaler

In [22]:
df_model = df.copy()

In [23]:
#Rescaling features age, trestbps, chol, thalach, oldpeak.
scaler = MinMaxScaler()

features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

In [24]:
#create x and y variable
x = df_model.drop(columns=['target'])
y = df_model['target']

In [25]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

In [26]:
#Training the model
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [27]:
#Predict testing set
y_pred = knn.predict(x_test)

In [28]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8688524590163934


In [29]:
#Check performance using roc
roc_auc_score(y_test, y_pred)

0.8705555555555555

- With MinMax Scaler, the performance of model is highest among other rescaling method with accuracy in 86% and roc at 87%!

### Hyperparameter Tuning

In [30]:
#List Hyperparameters
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]

In [31]:
#convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

In [32]:
#Making model
clf = GridSearchCV(knn, hyperparameters, cv=10)
best_model = clf.fit(x_train,y_train)

In [33]:
#Best Hyperparameters Value
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])

Best leaf_size: 1
Best p: 1
Best n_neighbors: 9


In [34]:
#Predict testing set
y_pred = best_model.predict(x_test)

In [35]:
#Check performance using accuracy
print(accuracy_score(y_test, y_pred))

0.8688524590163934


In [36]:
#Check performance using ROC
roc_auc_score(y_test, y_pred)

0.8644444444444445

- With tuning hyperparameters + rescaling its performance is slightly same with MinMaxScaling performance without tuning hyperparameters.