Dataset

The dataset used is the Heart Disease Data Set from the Cleveland database, created in 1988 by V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D. (UCI Machine Learning Repository, n.d.). As Cleveland is one of the major city in United States, the dataset used is relevant for use in prediction of heart disease in United States citizen. The original dataset consists of 76 attributes, but published experiments refer the dataset using a subset of 14 attributes, with 303 instances.

* age: age in years
* sex: 1 = male, 0 = female
* cp (4 values): chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
* trestbps: resting blood pressure in mm Hg on admission to the hospital
* chol: serum cholestoral in mg/dl
* fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
* restecg: resting electrocardiographic results (values 0,1,2)
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach: maximum heart rate achieved
* exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak: ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
Value 1: upsloping
Value 2: flat
Value 3: downsloping
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
* target: 1 = presence, 0 = absence

The dataset consisting 5 quantitative attributes which are age, trestbps, chol, thalach and oldpeak ; 9 qualitative categorical attributes which are sex, cp, fbs, restecg, exang, slope, ca, thal and target. ca is considered as qualitative categorical attributes as it only consists of 4 type of unique values.




In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

In [80]:
df = pd.read_csv("heart.csv")

In [81]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [82]:
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
array = df.values
X = array[:,0:13]
Y = array[:,13]

test = SelectKBest(score_func=f_classif, k=5)
fit = test.fit(X, Y)

set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

print(features[0:5,:])

[ 56.785  86.69  238.558  20.087  10.326   1.736  18.838 222.8   242.884
 243.451 138.679 174.877 131.803]
[[  0.  168.    0.    1.    2. ]
 [  0.  155.    1.    3.1   0. ]
 [  0.  125.    1.    2.6   0. ]
 [  0.  161.    0.    0.    1. ]
 [  0.  106.    0.    1.9   3. ]]


In [83]:
#cp, thalach,exang, slope, ca

In [84]:
print("Number of rows in the dataset: ", df.shape[0])
print("Number of columns in the dataset: ", df.shape[1])

Number of rows in the dataset:  1025
Number of columns in the dataset:  14


In [85]:
df.drop(df.columns.difference(['cp','thalach','exang','slope','ca','target']), 1, inplace=True)

  df.drop(df.columns.difference(['cp','thalach','exang','slope','ca','target']), 1, inplace=True)


In [86]:
df.head()

Unnamed: 0,cp,thalach,exang,slope,ca,target
0,0,168,0,2,2,0
1,0,155,1,0,0,0
2,0,125,1,0,0,0
3,0,161,0,2,1,0
4,0,106,0,1,3,0


In [87]:
df.tail()

Unnamed: 0,cp,thalach,exang,slope,ca,target
1020,1,164,1,2,0,1
1021,0,141,1,1,1,0
1022,0,118,1,1,1,0
1023,0,159,0,2,0,1
1024,0,113,0,1,1,0


In [88]:
x = df.drop(['target'],axis=1).values
y = df.iloc[:, -1].values

In [89]:
x

array([[  0, 168,   0,   2,   2],
       [  0, 155,   1,   0,   0],
       [  0, 125,   1,   0,   0],
       ...,
       [  0, 118,   1,   1,   1],
       [  0, 159,   0,   2,   0],
       [  0, 113,   0,   1,   1]], dtype=int64)

In [90]:
y

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [91]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

In [92]:
np.bincount(y_train)

array([376, 392], dtype=int64)

K-Nearest Neighbors (K-NN)

In [93]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

findingBestNeighborsKNN = KNeighborsClassifier()

param_grid = {'n_neighbors': np.arange(1, 25)}

knn_gscv = GridSearchCV(findingBestNeighborsKNN, param_grid, cv=5)

knn_gscv.fit(x, y)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

In [94]:
knn_gscv.best_params_

{'n_neighbors': 1}

In [95]:
knnClassifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
knnClassifier.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [96]:
x_test

array([[  2, 179,   1,   2,   0],
       [  1, 152,   0,   2,   2],
       [  0, 144,   1,   2,   2],
       ...,
       [  3, 131,   0,   1,   1],
       [  2, 146,   0,   1,   3],
       [  3, 144,   1,   1,   0]], dtype=int64)

In [97]:
y_pred = knnClassifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1], dtype=int64)

In [98]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
cm

array([[122,   1],
       [  1, 133]], dtype=int64)

In [100]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy of KNN: ", accuracy_score(y_test, y_pred) )
print("F1 Score of KNN: ", f1_score(y_test, y_pred) )
knncm = confusion_matrix(y_test, y_pred)

Accuracy of KNN:  0.9922178988326849
F1 Score of KNN:  0.9925373134328358


In [101]:
data = [[0,168,0,2,2],[0,161,0,2,1],[0,164,1,2,0],[0,159,0,2,0]]

burak_test = pd.DataFrame(data, columns=['cp','thalach','exang','slope','ca'])

burak_test

Unnamed: 0,cp,thalach,exang,slope,ca
0,0,168,0,2,2
1,0,161,0,2,1
2,0,164,1,2,0
3,0,159,0,2,0


In [102]:
burak_pred = knnClassifier.predict(burak_test)
#expected result = [0,0,1,1]
burak_pred

array([0, 0, 1, 1], dtype=int64)

In [103]:
pickle.dump(knnClassifier, open('knnModel.pkl','wb'))

In [104]:
model = pickle.load( open('knnModel.pkl','rb'))
print(model.predict([[0,168,0,2,2]]))

[0]


Support Vector Machine (SVM)

In [105]:
from sklearn.svm import SVC
svmClassifier = SVC(kernel = 'rbf', random_state = 0)
svmClassifier.fit(x_train, y_train)

SVC(random_state=0)

In [106]:
y_pred= svmClassifier.predict(x_test)
y_pred

array([1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0], dtype=int64)

In [107]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[ 88,  35],
       [ 31, 103]], dtype=int64)

In [109]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy of SVM: ", accuracy_score(y_test, y_pred) )
print("F1 Score of SVM: ", f1_score(y_test, y_pred) )
svmcm = confusion_matrix(y_test, y_pred)

Accuracy of SVM:  0.7431906614785992
F1 Score of SVM:  0.7573529411764706


In [110]:
burak_pred = svmClassifier.predict(burak_test.values)

burak_pred

array([1, 1, 1, 1], dtype=int64)

In [111]:
pickle.dump(svmClassifier, open('svmModel.pkl','wb'))

In [112]:
model = pickle.load(open('svmModel.pkl','rb'))
print(model.predict([[0,168,0,2,2]]))

[1]


Random Forest 

In [113]:
from sklearn.ensemble import RandomForestClassifier
rfClassifier = RandomForestClassifier(max_depth = 30, n_estimators = 500)
rfClassifier.fit(x_train, y_train)

RandomForestClassifier(max_depth=30, n_estimators=500)

In [114]:
y_pred= rfClassifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int64)

In [115]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[121,   2],
       [  1, 133]], dtype=int64)

In [117]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy of Random Forest Classifier: ", accuracy_score(y_test, y_pred) )
print("F1 Score of Random Forest Classifier: ", f1_score(y_test, y_pred) )

Accuracy of Random Forest Classifier:  0.9883268482490273
F1 Score of Random Forest Classifier:  0.9888475836431226


In [118]:
burak_pred = rfClassifier.predict(burak_test.values)

burak_pred


array([0, 0, 1, 1], dtype=int64)

In [119]:
pickle.dump(rfClassifier, open('rfModel.pkl','wb'))

In [120]:
model = pickle.load(open('dtModel.pkl','rb'))
print(model.predict([[0,168,0,2,2]]))

[0]


Decision Tree

In [121]:
from sklearn.tree import DecisionTreeClassifier
dtClassifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 15)
dtClassifier.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=15, random_state=0)

In [122]:
y_pred= dtClassifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1], dtype=int64)

In [123]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[122,   1],
       [  2, 132]], dtype=int64)

In [125]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy of Decision Tree Classifier: ", accuracy_score(y_test, y_pred) )
print("F1 Score of Decision Tree Classifier: ", f1_score(y_test, y_pred) )

Accuracy of Decision Tree Classifier:  0.9883268482490273
F1 Score of Decision Tree Classifier:  0.9887640449438201


In [126]:
burak_pred = dtClassifier.predict(burak_test.values)

burak_pred

array([0, 0, 1, 1], dtype=int64)

In [127]:
pickle.dump(dtClassifier, open('dtModel.pkl','wb'))

In [128]:
model = pickle.load(open('dtModel.pkl','rb'))
print(model.predict([[0,168,0,2,2]]))

[0]
