#**Heart Disease Detection System Using Decision Tree Classifier**


> This is a post-workshop assignment for Google Developer Student Club ITB <br>
Rizky Ramadhana P. K.<br>
16520285<br>
Institut Teknologi Bandung<br>
8 November 2020<br>
Dataset Source : https://archive.ics.uci.edu/ml/datasets/heart+Disease

Every year, 17.9 million people die due to hearth disease. This contributes to an estimated 31% of all deaths worldwide. Furthermore, 85% of all heart disease deaths are caused by heart attack and strokes, which usually called 'silent killer'*. Hence, it is very important to create a heart disease early detection system to reduce the number of heart disease death. <br>
Here, we will utilize data from heart disease patient and process it with machine learning algorithm. The main purpose is creting a model that can predict whether someone have a risk at heart disease or not based on several feature. Then, we deploy our trained machine learning alhorithm to a website so that everybody could use them.<br><br>
*cited from [WHO's article on cardiovascular disease](https://www.who.int/health-topics/cardiovascular-diseases/) <br><br>
## Data Exploration And Data Preprocessing
We will use dataset from [University of California Irvine's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/heart+Disease), which is oftenly used by researchers across the globe. Without further ado, let's go to the first step, explore the data. 



In [None]:
import pandas as pd
data = pd.read_csv('processed.cleveland.csv', header = None)
data.columns = ['age', 'is_male', 'chestpain', 'restbps', 'chol', 'fbs', 'restecg', 'mhr', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']
#columns' name are listed on dataset's link. Detailed feature for each column would be explained soon
data

When we observe the data, we know that some rows have a not valid value. It use '?' instead of numbers. I choose to remove those rows since our data is big enough and would not lose its coherence. Then, we convert all value in this data to numbers.

In [None]:
for n in data.columns:            #Remove rows with '?'
    data = data[~(data[n]=='?')]

data = data.apply(pd.to_numeric)  #Convert to numbers
data['target'].value_counts()

The column 'target' will have an integer value from 0 (no heart disease occurs) to 4. We will find out that column 'target' is imbalanced since every value does not have same quantity.


    Quantity for each value in 'target'
    0    160
    1     54
    3     35
    2     35
    4     13

So, I decided to group the data into only two target. Zero for absence of heart disease and one for presence.<br>
I also decided to only use several column since I want this model would work for everyone without fancy blood test result. Here are the column that we will use to build the model:


1. Age
2. Sex
3. Presence of Angina (chest pain)*. Here are the values :

  * 1 for typical angina
  * 2 for atypical angina
  * 3 for non-anginal pain
  * 4 for asymptomatic <br>
  A pain will be categorized as typical angina if it meets this three criteria :
  * Substernal chest discomfort of characteristic quality and duration
  * Provoked by exertion or emotional stress
  * Relieved by rest and/or nitroglycerine <br>
  An atypical angina will meets two of three criteria. A non-anginal pain will only meet one criteria. While asymptomatic means none of that criteria are experinced by the user.

4. Resting Blood Pressure (in mmHg)
5. Serum Cholesterol Level (in mg/dL)
6. Presence of Exercise Induced Angina <br>
Will have value of one if the user are experiencing chest pain while doing hard activities. <br><br>
*cited from [TextBookofCardiology.org](https://www.textbookofcardiology.org/wiki/Chest_Pain_/_Angina_Pectoris)





In [None]:
data.loc[(data['target']>0), 'target'] = 1                      #Group the data to only two target, presence and absence of heart disease
data.drop(['fbs', 'restecg', 'oldpeak', 'slope', 'ca', 'thal'], inplace =True, axis = 1)       #Removing columns that we would not use

data['is_male'].value_counts()

Next, we observe the column 'is_male', which has a value of one if the patient is male and value of zero if the patient is female. If we look closely, the numbers of male and female are not balanced. <br>



> 


    Numbers of male and female
    1.0    201
    0.0     96

Hence, I decided to upsample the female data so that it will create a balanced and good data for our machine learning model.

In [None]:
from sklearn.utils import resample                  #Upsample the female data
female = data[data['is_male']==0]
male = data[data['is_male']==1]
upsampled = resample(female, replace=True, n_samples = 201, random_state = 77)
data = pd.concat([upsampled, male])

x = data.iloc[:, :7]       #Features of each patient (feature)
y = data['target']         #List that told us whether the patient has a heart disease or not (target)

Here, we are finished in exploring and preprocessing our data. We hope that this clean data would train our model well. Below are how our data looks like after being cleaned.


> 
          age  is_male  chestpain  restbps   chol    mhr  exang  target
    0    63.0      1.0        1.0    145.0  233.0  150.0    0.0       0
    1    67.0      1.0        4.0    160.0  286.0  108.0    1.0       1
    2    67.0      1.0        4.0    120.0  229.0  129.0    1.0       1
    3    37.0      1.0        3.0    130.0  250.0  187.0    0.0       0
    4    41.0      0.0        2.0    130.0  204.0  172.0    0.0       0
    ..    ...      ...        ...      ...    ...    ...    ...     ...
    297  57.0      0.0        4.0    140.0  241.0  123.0    1.0       1
    298  45.0      1.0        1.0    110.0  264.0  132.0    0.0       1
    299  68.0      1.0        4.0    144.0  193.0  141.0    0.0       1
    300  57.0      1.0        4.0    130.0  131.0  115.0    1.0       1
    301  57.0      0.0        2.0    130.0  236.0  174.0    0.0       1    



## Choosing The Right Model
We would try Decision Tree, K-Nearest Neighbors, Random Forest, Naive Bayes, and Logistic Regression. Then we will choose model with highest accuracy to be deployed later. We would use Cross Validation method to calculate accuracy. For Decision Tree, K-Nearest Neighbors, and Random Forest, we also tune the hyperparameter using GridSearchCV. The hyperparameter that would be tuned is stored in variable 'parameter'.

In [None]:
import sklearn                                                     #Defining the model
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
dtree = DecisionTreeClassifier(max_depth = 2)
knn = KNeighborsClassifier(n_neighbors = 2)
gnb = GaussianNB()
randomforest = RandomForestClassifier(max_depth = 2)
log = LogisticRegression(max_iter = 1000)

parameter = {                                              #Hyperparameter that will be tuned
    dtree : {'max_depth' : range(1,31)},
    knn : {'n_neighbors' : range(1,31)},
    randomforest : {'max_depth' : range(1,31)}
}

for i in [dtree, knn, randomforest]:                      #Finding the best hyperparameter for each model
    gridsearch = GridSearchCV(i, param_grid = parameter[i], scoring = 'accuracy', cv = 5)
    gridsearch.fit(x,y)
    print(gridsearch.best_score_)                       #Print the best accuracy for each model
    print(gridsearch.best_params_)                      #Print hyperparameter that yields best accuracy

cross_val_score(gnb, x, y, scoring = 'accuracy', cv = 5).mean()   #Calculating accuracy for Naive Bayes model
cross_val_score(log, x, y, scoring = 'accuracy', cv = 5).mean()   #Calculating accuracy for Logistic Regression model


When I ran the code, the best model was Decision Tree Classifier with hyperparameter 'max_depth' = 11. It yields accuracy of 81 %. Therefore, we would use this model and save it to .pkl file to be deployed on the website later.

In [None]:
model = DecisionTreeClassifier(max_depth = 11)
model.fit(x,y)
import pickle
pickle.dump(model, open('model.pkl', 'wb'))

## Conclusion


1.   80+ % accuracy means that there are really a correlation between features that we choose and the presence of heart disease.
2.   Since the accuracy is less than 90 %, I suggest that this model is not used in making medical desicion.
3. However, we still can use this model as a early detection of heart disease. But, further examination should be done by doctor.

