## Heart Risk Model

Prediction of Heart Risk failure with Classification Algorithms; 
Logistic Regression,
Support Vector Machine and
K Nearest Neighbor. Inorder to determine the algorithm with the Best erformance rate.

In [1]:
# Import necessary libraries
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC 
from sklearn import neighbors

In [2]:
# import dataset
df = pd.read_csv(r"../data/heart.csv")

In [3]:
df.head(10)

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [4]:
# View the data dimension
df.shape

(303, 14)

In [5]:
# Inspecting the types of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [None]:
We can deduce that, the features are all numeric, no missing values, 13 features and 303 samples

In [7]:
# check for class imbalance
df.output.value_counts()

1    165
0    138
Name: output, dtype: int64

In [None]:
We can see that the class balance is close

In [8]:
df.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'],
      dtype='object')

In [9]:
# inspect for anomalies in feature behavior

df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [3]:
# Split data into X and y

dfArr = df.values # converting our dataframe into an array
X = dfArr[:, 0:13]
y = dfArr[:, 13]
Xtrain,  Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2)

In [32]:
dfArr

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

### Logistic Regression

In [4]:
# Seeding, Building and Training model

seedSearch= [0,2,4,6,8,12]

for seed1 in seedSearch:
    Xtrain, Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2, random_state=seed1)
    
    model = LogisticRegression()
    model.fit(Xtrain,yTrain)
    print(seed1, ":", model.score(Xtest,yTest))

0 : 0.8524590163934426
2 : 0.9016393442622951
4 : 0.9016393442622951
6 : 0.8360655737704918
8 : 0.8688524590163934
12 : 0.7868852459016393


In [5]:
LR = LogisticRegression(random_state=2)
LR = model.fit(Xtrain,yTrain)

y_pred = LR.predict(Xtest)

print(classification_report(yTest, y_pred))
print(confusion_matrix(yTest, y_pred))
LRAcc = accuracy_score(y_pred,yTest)
print('LR accuracy: {:.2f}%'.format(LRAcc*100))

              precision    recall  f1-score   support

         0.0       0.85      0.71      0.77        31
         1.0       0.74      0.87      0.80        30

    accuracy                           0.79        61
   macro avg       0.79      0.79      0.79        61
weighted avg       0.80      0.79      0.79        61

[[22  9]
 [ 4 26]]
LR accuracy: 78.69%


In [None]:
We can infer that LogisticRegression has an optimum performance of 90% with seed size of 2 or 4 respectively.

### Support Vector Machine

In [6]:
# Seeding, Building and training model
seedSearch= [0,2,4,6,8,12]

for seed2 in seedSearch:
    Xtrain, Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2, random_state=seed2)
    
    svc = SVC(kernel='linear')
    svc.fit(Xtrain, yTrain)
    print(seed2, ":", svc.score(Xtest, yTest))

0 : 0.819672131147541
2 : 0.8688524590163934
4 : 0.9180327868852459
6 : 0.819672131147541
8 : 0.8688524590163934
12 : 0.7868852459016393


In [7]:
Xtrain, Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2, random_state=4)
    
svc = SVC(kernel='linear')
svc.fit(Xtrain, yTrain)

y_pred = svc.predict(Xtest)

print(classification_report(yTest, y_pred))
print(confusion_matrix(yTest, y_pred))
SVCAcc = accuracy_score(y_pred,yTest)
print('SVC accuracy: {:.2f}%'.format(SVCAcc*100))

              precision    recall  f1-score   support

         0.0       0.92      0.88      0.90        25
         1.0       0.92      0.94      0.93        36

    accuracy                           0.92        61
   macro avg       0.92      0.91      0.91        61
weighted avg       0.92      0.92      0.92        61

[[22  3]
 [ 2 34]]
SVC accuracy: 91.80%


In [None]:
We can infer that Support Vector Machine had an optimum performance of 91% with seed size of 4

### K Nearest Neighbor (KNN)

In [39]:
# Seeding, Building and training model
seedSearch= [0,2,4,6,8,12,14,16]

for seed3 in seedSearch:
    Xtrain, Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2, random_state=seed3)
    
    knn = neighbors.KNeighborsClassifier(n_neighbors=5)
    knn.fit(Xtrain, yTrain)
    print(seed3, ":", knn.score(Xtest, yTest))

0 : 0.639344262295082
2 : 0.7049180327868853
4 : 0.5737704918032787
6 : 0.6721311475409836
8 : 0.6885245901639344
12 : 0.6557377049180327
14 : 0.5901639344262295
16 : 0.6065573770491803


In [41]:
Xtrain, Xtest, yTrain, yTest = train_test_split(X,y, test_size=0.2, random_state=2)
knn = neighbors.KNeighborsClassifier(n_neighbors=5,)
knn.fit(Xtrain, yTrain)

y_pred = knn.predict(Xtest)
knnAcc = accuracy_score(y_pred,yTest)
print('KNN accuracy: {:.2f}%'.format(knnAcc*100))

KNN accuracy: 70.49%


In [None]:
We can infer that KNN had an optimum performance of 70% with seed size of 2.

#### Model Comparison

In [42]:
compAlg = pd.DataFrame({'Model': ['Logistic Regression',
                                  'Support Vector Machine', 'K Nearest Neighbor'],
                        'Accuracy': [LRAcc*100, SVCAcc*100, knnAcc*100]})
compAlg.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Model,Accuracy
1,Support Vector Machine,91.803279
0,Logistic Regression,90.163934
2,K Nearest Neighbor,70.491803


In [None]:
REPORT

We can infer that Logistic Regression has an optimum performance of 90% with seed size of 2 or 4 respectively.
Support Vector Machine had an optimum performance of 91% with seed size of 4.
KNN had an optimum performance of 70% with seed size of 2.

The Result shows that Support Vector Machine outperformed the other algorithms with a sample size of 303.