<a href="https://colab.research.google.com/github/abuwildanm/Heart-Disease/blob/master/Heart_Disease_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Disease Diagnosis

## Introduction
---



## Import Libraries and Tools

In [1]:
# Import standard library
import numpy as np
import pandas as pd

# Import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.neural_network import MLPClassifier

# Import visualization library
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("ggplot")
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import cufflinks as cf
cf.go_offline()

import IPython
# Plotly Configuration
# Ref : https://stackoverflow.com/questions/47230817/plotly-notebook-mode-with-google-colaboratory
def configure_plotly_browser_state():
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
IPython.get_ipython().events.register('pre_run_cell', configure_plotly_browser_state)

## Dataset Description
Taken from Kaggle: https://www.kaggle.com/ronitf/heart-disease-uci

---

### Attribute Information:

1. **age**: the patient's age in years
2. **sex**: the patient's gender (1=male; 0=female)
3. **cp**: chest pain type (4 values)
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. **trestbps**: the patient's resting blood pressure (mm Hg on admission to the hospital)
5. **chol**: the patient's cholesterol measurement in mg/dl
6. **fbs**: the patient's fasting blood sugar (> 120 mg/dl, 1=true; 0=false)
7. **restecg**: resting electrocardiographic results (3 values)
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. **thalach**: the patient's maximum heart rate achieved
9. **exang**: exercise induced angina (1=yes; 0=no)
10. **oldpeak**: ST depression induced by exercise relative to rest
11. **slope**: the slope of the peak exercise ST segment (3 values)
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. **ca**: number of major vessels (0-3) colored by flourosopy
13. **thal**: a blood disorder called thalassemia (3=normal; 6=fixed defect; 7=reversable defect)
14. **target**: heart disease (0=no; 1=yes)

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/abuwildanm/Heart-Disease/master/heart.csv')
df = data.copy()
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
feature_discrete = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
feature_continuous = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

In [4]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [6]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### Hypothesis
---

To avoid [HARKing](https://journals.sagepub.com/doi/abs/10.1207/s15327957pspr0203_4) (or Hypothesizing After the Results are Known) let's take a look at online guides on how heart disease is diagnosed, and look up some of the terms above.

**Diagnosis**: The diagnosis of heart disease is done on a combination of clinical signs and test results. The types of tests run will be chosen on the basis of what the physician thinks is going on [[1](https://www.mayoclinic.org/diseases-conditions/heart-disease/diagnosis-treatment/drc-20353124)], ranging from electrocardiograms and cardiac computerized tomography (CT) scans, to blood tests and exercise stress tests [[2](https://www.heartfoundation.org.au/your-heart/living-with-heart-disease/medical-tests)].

Looking at information of heart disease risk factors led us to the following: **high cholesterol, high blood pressure, diabetes, weight, family history and smoking** [[3](https://www.bhf.org.uk/informationsupport/risk-factors)]. According to another source [[4](https://www.heart.org/en/health-topics/heart-attack/understand-your-risks-to-prevent-a-heart-attack)], the major factors that can't be changed are: **increasing age, male gender and heredity**. Note that thalassemia, one of the variables in this dataset, is heredity. Major factors that can be modified are: **Smoking, high cholesterol, high blood pressure, physical inactivity, and being overweight and having diabetes**. Other factors include **stress, alcohol and poor diet/nutrition**.

We can see no reference to the 'number of major vessels', but given that the definition of heart disease is **"...what happens when your heart's blood supply is blocked or interrupted by a build-up of fatty substances in the coronary arteries"**, it seems logical the more major vessels is a good thing, and therefore will reduce the probability of heart disease.

Given the above, we would hypothesis that, if the model has some predictive ability, we'll see these factors standing out as the most important.

## Exploratory Data Analysis

### Data Manipulation

In [7]:
data.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 
                'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 
                'max_heart_rate_achieved', 'exercise_induced_angina', 
                'st_depression', 'st_slope', 'num_major_vessels', 
                'thalassemia', 'target']

col_discrete = ['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 
                'exercise_induced_angina', 'st_slope', 'num_major_vessels', 
                'thalassemia']
col_continuous = ['age', 'resting_blood_pressure', 'cholesterol', 
                  'max_heart_rate_achieved', 'st_depression']

In [8]:
data = data[data['thalassemia'] > 0]
data.reset_index(drop=True, inplace=True)
df = df[df['thal'] > 0]
df.reset_index(drop=True, inplace=True)

In [9]:
data['sex'][data['sex'] == 0] = 'female'
data['sex'][data['sex'] == 1] = 'male'

data['chest_pain_type'][data['chest_pain_type'] == 0] = 'typical angina'
data['chest_pain_type'][data['chest_pain_type'] == 1] = 'atypical angina'
data['chest_pain_type'][data['chest_pain_type'] == 2] = 'non-anginal pain'
data['chest_pain_type'][data['chest_pain_type'] == 3] = 'asymptomatic'

data['fasting_blood_sugar'][data['fasting_blood_sugar'] == 0] = 'lower than 120mg/ml'
data['fasting_blood_sugar'][data['fasting_blood_sugar'] == 1] = 'greater than 120mg/ml'

data['rest_ecg'][data['rest_ecg'] == 0] = 'normal'
data['rest_ecg'][data['rest_ecg'] == 1] = 'ST-T wave abnormality'
data['rest_ecg'][data['rest_ecg'] == 2] = 'left ventricular hypertrophy'

data['exercise_induced_angina'][data['exercise_induced_angina'] == 0] = 'no'
data['exercise_induced_angina'][data['exercise_induced_angina'] == 1] = 'yes'

data['st_slope'][data['st_slope'] == 0] = 'upsloping'
data['st_slope'][data['st_slope'] == 1] = 'flat'
data['st_slope'][data['st_slope'] == 2] = 'downsloping'

data['thalassemia'][data['thalassemia'] == 1] = 'normal'
data['thalassemia'][data['thalassemia'] == 2] = 'fixed defect'
data['thalassemia'][data['thalassemia'] == 3] = 'reversable defect'

In [10]:
df['sex'] = df['sex'].astype('object')
df['cp'] = df['cp'].astype('object')
df['fbs'] = df['fbs'].astype('object')
df['restecg'] = df['restecg'].astype('object')
df['exang'] = df['exang'].astype('object')
df['slope'] = df['slope'].astype('object')
df['ca'] = df['ca'].astype('object')
df['thal'] = df['thal'].astype('object')

### Data Distribution

In [11]:
data[col_discrete].iplot(kind='hist', subplots=True, title='Discrete Attribute Distribution')

In [12]:
data[col_continuous].iplot(kind='hist', subplots=True, title='Continuous Attribute Distribution')

In [13]:
data['target'].iplot(kind='hist', subplots=True, title='Target Distribution')

### Discrete Attribute

In [14]:
def plot_discrete(variable, term):
    cross_table = pd.crosstab(data['target'], data[variable])

    # Visualisation
    x = cross_table.index
    data_bar = []
    for col in cross_table.columns:
        bar = go.Bar(name=col, x=x, y=cross_table[col])
        data_bar.append(bar)

    fig = go.Figure(data=data_bar)
    fig.update_layout(barmode='group', 
                    title='The number of heart disease sufferers in terms of ' + term, 
                    xaxis_title='Diagnosis of Heart Disease (0=no; 1=yes)', 
                    yaxis_title='Frequency')
    fig.show()

In [15]:
plot_discrete('sex', 'Gender')

In [16]:
plot_discrete('chest_pain_type', 'Chest Pain Type')

In [17]:
plot_discrete('fasting_blood_sugar', 'Fasting Blood Sugar')

In [18]:
plot_discrete('rest_ecg', 'Resting ECG Results')

In [19]:
plot_discrete('exercise_induced_angina', 'Exercise Induced Angina')

In [20]:
plot_discrete('st_slope', 'the slope of the peak exercise ST segment')

In [21]:
plot_discrete('num_major_vessels', 'Number of Major Vessels')

In [22]:
plot_discrete('thalassemia', 'Thalassemia')

### Continuous Attribute

In [23]:
trace_splom = go.Splom(dimensions=[dict(label='age', values=data['age']),
                                   dict(label='resting_blood_pressure', values=data['resting_blood_pressure']),
                                   dict(label='cholesterol', values=data['cholesterol']),
                                   dict(label='max_heart_rate_achieved', values=data['max_heart_rate_achieved']),
                                   dict(label='st_depression', values=data['st_depression'])],
                      text=data['target'],
                      marker=dict(color=data['target'],
                                  colorscale='Bluered', 
                                  showscale=False, 
                                  line_color='white', 
                                  line_width=0.5)
                      )
fig = go.Figure(data=trace_splom)
fig.update_layout(title='Continuous Attribute Scatter', width=1000, height=1000)
fig.show()

### Attribute Correlation

In [24]:
# Pearson Product Correlation
corr = data.corr()

hm = go.Heatmap(z=corr, x=corr.index, y=corr.columns, colorscale='RdBu', reversescale=True)
layout = go.Layout(title='Correlation Heatmap')
fig = go.Figure(data=hm, layout=layout)
fig.show()

## Preprocessing

### Duplication Check

In [25]:
df[df.duplicated(keep=False)]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
162,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1
163,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


In [26]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

### Missing Values Check

In [27]:
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Outlier Detection

In [28]:
box = [go.Box(y=df[col], name=col) for col in df]
layout = go.Layout(title='Outlier Detection')
fig = go.Figure(data=box, layout=layout)
fig.show()

In [29]:
# Q1 = df.quantile(0.25)
# Q3 = df.quantile(0.75)
# IQR = Q3 - Q1

# lowerBound = Q1 - 1.5*IQR
# upperBound = Q3 + 1.5*IQR

# original = df.copy()
# df = df[~((df.iloc[:, :-1] < lowerBound) | (df.iloc[:, :-1] > upperBound)).any(axis=1)]
# df.reset_index(drop=True, inplace=True)

# print('Data dengan outlier              : ', original.shape)
# print('Data setelah outlier dihilangkan : ', df.shape)

### Feature Selection

### Feature Scaling

In [30]:
df = pd.get_dummies(df)
target = df['target'].copy()
df.drop(columns='target', inplace=True)
df['target'] = target
df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,sex_0,sex_1,cp_0,cp_1,cp_2,cp_3,fbs_0,fbs_1,restecg_0,restecg_1,restecg_2,exang_0,exang_1,slope_0,slope_1,slope_2,ca_0,ca_1,ca_2,ca_3,ca_4,thal_1,thal_2,thal_3,target
0,63,145,233,150,2.3,0,1,0,0,0,1,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1
1,37,130,250,187,3.5,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,1,0,1
2,41,130,204,172,1.4,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,1
3,56,120,236,178,0.8,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,1,0,1
4,57,120,354,163,0.6,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1


In [31]:
mms = MinMaxScaler()
mms_df = mms.fit_transform(df[feature_continuous])
mms_df = pd.DataFrame(data=mms_df, columns=df[feature_continuous].columns)

In [32]:
mms_df.iplot(kind='hist', subplots=True, title='Min-Max Distribution')

In [33]:
ss = StandardScaler()
ss_df = ss.fit_transform(df[feature_continuous])
ss_df = pd.DataFrame(ss_df, columns=df[feature_continuous].columns)

In [34]:
ss_df.iplot(kind='hist', subplots=True, title='Z-Transform Distribution')

In [35]:
# df[feature_continuous] = mms_df.copy()
df[feature_continuous] = ss_df.copy()

## Split the data

In [36]:
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1], test_size=0.3, stratify=df.iloc[:, -1])

## Build Models

### Support Vector Machine

In [37]:
svc = SVC()
svc.fit(x_train, y_train)
print('Train Score : ', svc.score(x_train, y_train))
print('Test Score : ', svc.score(x_test, y_test))

Train Score :  0.9523809523809523
Test Score :  0.8111111111111111


In [38]:
cvSVC = pd.DataFrame(cross_validate(svc, df.iloc[:, :-1], df.iloc[:, -1], cv=10, return_train_score=True))
cvSVC

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.008703,0.001648,0.9,0.922222
1,0.00562,0.001539,0.833333,0.918519
2,0.005498,0.001509,0.866667,0.925926
3,0.005451,0.001568,0.966667,0.92963
4,0.005562,0.001509,0.833333,0.925926
5,0.005313,0.0017,0.8,0.933333
6,0.005888,0.001561,0.866667,0.92963
7,0.00601,0.001549,0.833333,0.92963
8,0.005329,0.001637,0.766667,0.944444
9,0.006259,0.001493,0.766667,0.937037


In [39]:
cvSVC.mean()

fit_time       0.005963
score_time     0.001571
test_score     0.843333
train_score    0.929630
dtype: float64

### Random Forest

In [40]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
print('Train Score : ', rf.score(x_train, y_train))
print('Test Score : ', rf.score(x_test, y_test))

Train Score :  1.0
Test Score :  0.7777777777777778


In [41]:
cvRF = pd.DataFrame(cross_validate(rf, df.iloc[:, :-1], df.iloc[:, -1], cv=10, return_train_score=True))
cvRF

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.162819,0.012153,0.9,1.0
1,0.153963,0.011491,0.766667,1.0
2,0.153641,0.011653,0.866667,1.0
3,0.152365,0.011723,0.933333,1.0
4,0.155862,0.012966,0.833333,1.0
5,0.155094,0.011668,0.833333,1.0
6,0.153904,0.011952,0.733333,1.0
7,0.154502,0.020263,0.866667,1.0
8,0.155096,0.011955,0.766667,1.0
9,0.151886,0.011892,0.8,1.0


In [42]:
cvRF.mean()

fit_time       0.154913
score_time     0.012772
test_score     0.830000
train_score    1.000000
dtype: float64

### Multi Layer Perceptron

In [43]:
mlp = MLPClassifier()
mlp.fit(x_train, y_train)
print('Train Score : ', mlp.score(x_train, y_train))
print('Test Score : ', mlp.score(x_test, y_test))

Train Score :  0.9476190476190476
Test Score :  0.8111111111111111



Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



In [44]:
cvMLP = pd.DataFrame(cross_validate(mlp, df.iloc[:, :-1], df.iloc[:, -1], cv=10, return_train_score=True))
cvMLP


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.508048,0.002481,0.9,0.959259
1,0.545923,0.002451,0.833333,0.959259
2,0.49547,0.002463,0.833333,0.955556
3,0.520412,0.002545,0.866667,0.948148
4,0.497756,0.002483,0.833333,0.959259
5,0.490587,0.002501,0.866667,0.948148
6,0.499081,0.002459,0.8,0.962963
7,0.501263,0.002398,0.833333,0.962963
8,0.485117,0.002424,0.766667,0.962963
9,0.513563,0.002409,0.733333,0.944444


In [45]:
cvMLP.mean()

fit_time       0.505722
score_time     0.002462
test_score     0.826667
train_score    0.956296
dtype: float64

In [46]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10,shuffle=True,random_state=0)

KNN = KNeighborsClassifier(n_neighbors = 13)
KNN.fit(x_train, y_train)
print('Train Score : ', KNN.score(x_train, y_train))
print('Test Score : ', KNN.score(x_test, y_test))

Train Score :  0.861904761904762
Test Score :  0.8111111111111111


In [47]:
cvKNN = pd.DataFrame(cross_validate(KNN, df.iloc[:, :-1], df.iloc[:, -1], cv=10, return_train_score=True))
cvKNN

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.004455,0.005005,0.9,0.855556
1,0.003231,0.003502,0.8,0.862963
2,0.002859,0.003267,0.833333,0.851852
3,0.003166,0.003855,0.966667,0.844444
4,0.003103,0.003562,0.9,0.848148
5,0.003048,0.003313,0.733333,0.866667
6,0.003028,0.003304,0.866667,0.855556
7,0.003087,0.00445,0.866667,0.851852
8,0.003166,0.003299,0.766667,0.862963
9,0.002981,0.003249,0.8,0.855556


In [48]:
cvKNN.mean()

fit_time       0.003213
score_time     0.003681
test_score     0.843333
train_score    0.855556
dtype: float64

## Model Evaluation

## Conclusion