## BUILDING A HEART DISEASE CLASSIFIER CONTINUED

## MODELLING

We have seen that our dataset is pretty distributed apart from our hypothesis using the inbalance in the data representing the sex of the patient.
it is time to start building machine learning models to find which one that might reach 95% accuracy on predicting whether a person has heart disease or not given his or her attributes.

I am going to use classification models as I am trying to solve a classification problem.

In [2]:
import pandas as pd
import numpy as np


df = pd.read_csv("heart-disease.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### SPLITTING THE DATASET INTO X AND Y (features and outcomes)

In [3]:

X = df[['age', 'sex', 'cp', 'trestbps', 
        'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']].values
y = df['target'].values

feature_names = ['age', 'sex', 'cp', 'trestbps', 
        'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

X

array([[63.,  1.,  3., ...,  0.,  0.,  1.],
       [37.,  1.,  2., ...,  0.,  0.,  2.],
       [41.,  0.,  1., ...,  2.,  0.,  2.],
       ...,
       [68.,  1.,  0., ...,  1.,  2.,  3.],
       [57.,  1.,  0., ...,  1.,  1.,  3.],
       [57.,  0.,  1., ...,  1.,  1.,  2.]])

In [4]:
#### NORMALIZING THE DATASET FOR SOME OF THE MACHINE LEARNING MODELS
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
X_scaled = scale.fit_transform(X)
X_scaled

array([[ 0.9521966 ,  0.68100522,  1.97312292, ..., -2.27457861,
        -0.71442887, -2.14887271],
       [-1.91531289,  0.68100522,  1.00257707, ..., -2.27457861,
        -0.71442887, -0.51292188],
       [-1.47415758, -1.46841752,  0.03203122, ...,  0.97635214,
        -0.71442887, -0.51292188],
       ...,
       [ 1.50364073,  0.68100522, -0.93851463, ..., -0.64911323,
         1.24459328,  1.12302895],
       [ 0.29046364,  0.68100522, -0.93851463, ..., -0.64911323,
         0.26508221,  1.12302895],
       [ 0.29046364, -1.46841752,  0.03203122, ..., -0.64911323,
         0.26508221, -0.51292188]])

### Using cross validation  to try to pick out the best model
### Decision Tree Model

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

np.random.seed(42)
clf = DecisionTreeClassifier()
cv_score = cross_val_score(clf, X_scaled, y, cv=10)
cv_score.mean()

0.7649462365591397

### Random Forest Classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
cv_score = cross_val_score(clf, X_scaled, y, cv=10)
cv_score.mean()

0.8249462365591398

### KneighborsClassifier

In [7]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
cv_score = cross_val_score(clf, X_scaled, y, cv=10)
cv_score.mean()

0.8183870967741935

### LogisticRegression Model

In [8]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
cv_score = cross_val_score(clf, X_scaled, y, cv=10)
cv_score.mean()

0.8116129032258066

### Xgboost Classifier

In [9]:
from xgboost import XGBClassifier

clf = XGBClassifier(eval_metric="logloss", use_label_encoder=False)
cv_score = cross_val_score(clf, X_scaled, y, cv=10)
cv_score.mean()

0.7981720430107527

### Support Vector Machines

In [10]:
# Linear Kernel
from sklearn.svm import SVC

C=1.0
svc = SVC(kernel='linear', C=C)
cv_score = cross_val_score(svc, X_scaled, y, cv=10)
cv_score.mean()

0.8281720430107526

In [11]:
# RBF Kernel

svc = SVC(kernel='rbf', C=C)
cv_score = cross_val_score(svc, X_scaled, y, cv=10)
cv_score.mean()

0.8282795698924732

In [12]:
# Sigmoid Kernel

svc = SVC(kernel='sigmoid', C=C)
cv_score = cross_val_score(svc, X_scaled, y, cv=10)
cv_score.mean()

0.8347311827956989

In [13]:
# Poly Kernel
svc = SVC(kernel="poly", C=C)
cv_score = cross_val_score(svc, X_scaled, y, cv=10)
cv_score.mean()

0.8086021505376344

### NaiveBayes Model

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
minmax_features =  scaler.fit_transform(X)


clf = MultinomialNB()
cv_score = cross_val_score(clf, minmax_features, y, cv=10)
cv_score.mean()

0.7853763440860215

### Clearly we can see that support vector machines are leading the race with the highest accuracy at 83 percent each for linear, rbf and sigmoid kernel, but for the puporse of learning, I wiil tune all of the models' hyperparameters except the simple DecisionTree model and  NaiveBayes Model


code for hyperparameter tunning can be found in `hyperparameter-tunning-heart-disease-classificication.ipynb`