<a href="https://colab.research.google.com/github/devi777/Heart-Disease-Classification/blob/master/HD_Prediction2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contents

1. K-nearest Neighbors

# Pre-processing Data

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [0]:
df = pd.read_csv('heart.csv')

In [0]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/5, random_state = 0)

In [0]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Analyzing Data using KNN

Docmuentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

## 1. With standard values

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 81.97%


In [0]:
y_pred = classifier.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[21  6]
 [ 5 29]]


In [0]:
print("Accuracy:",str(round((21+29)/(21+5+29+6)*100,2)), "%")

Accuracy: 81.97 %


So, Logistic Regression fits better at predicting whether the patient as a heart disease or not compared to KNN with standard parameters. And according to these parameters, the classification is done according to which class (0/1) has the most number of neighbors of the 5 nearest neighbors selected.

The distance metric chosen here was the standard minkowski. Let's which distance metric gives a better result. Here are the list of distance metrics in sklearn library:

## Metrics intended for real-valued vector spaces:

identifier    |      class name      |    args   |     distance function
              
“euclidean”       EuclideanDistance                   sqrt(sum((x - y)^2))

“manhattan”       ManhattanDistance                   sum(|x - y|)

“chebyshev”       ChebyshevDistance                   max(|x - y|)

“minkowski”       MinkowskiDistance        p          sum(|x - y|^p)^(1/p)

“wminkowski”      WMinkowskiDistance      p, w        sum(|w * (x - y)|^p)^(1/p)

“seuclidean”      SEuclideanDistance       V          sqrt(sum((x - y)^2 / V))

“mahalanobis”     MahalanobisDistance     V/VI        sqrt((x - y)' V^-1 (x - y))

## Metrics intended for two-dimensional vector spaces: 

identifier    |      class name      |    args   |                 distance function

“haversine”       HaversineDistance                 2 arcsin(sqrt(sin^2(0.5*dx) + cos(x1)cos(x2)sin^2(0.5*dy)))

Note that the haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

For other distance metrics , such as metrics intended for integer-valued vector spaces or user defined distance, refer to: 
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

## 2. With more n_neighbors

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=7,metric='minkowski',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 83.61%


The accuracy has just increased. Let's try increasing the value to 10.

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10,metric='minkowski',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 88.52%


So, more the no of n_neighbors, the better is the performance of the machine. Good, now let's try changing the p value.

## 3. With p = 1 (Manhattan distance)

The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|. Also, it comes under taxicab geometry, which is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates.

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=1)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=1,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 83.61%


So, it appears that using manhattan distance suits this dataset. Great, Now let's try the differerent distance metrics we talked about before.

## 4. With different distance metrics

In [0]:
#let's try with euclidean

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5,metric='euclidean',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 81.97%


Nope, no difference. Let's try manhattan.

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5,metric='manhattan',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 83.61%


Yup, better. Let's try wminkowski then.

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5,metric='wminkowski',p=2)
classifier.fit(X_train,y_train)

#Didnt work.. has 2 positional arguments.

Couldn't find anywhere how to calculate and call wminkowski as the second argument with KNeighborsClassifier (they had used different files such as HBD).. moving on.

## Best performance set

In [0]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10,metric='minkowski',p=2)
classifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 88.52%


Increasing the n_neighbors is reducing the performance of the classifier. So 10 neighbors is the optimal value.