<a href="https://colab.research.google.com/github/byunsy/heart-disease-diagnosis/blob/main/Heart_Disease_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Disease Classification


---



## 01. Import Packages

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## 02. Upload Dataset

Unzip the uploaded dataset.

In [None]:
!unzip /content/heart_disease.zip

## 03. Understanding the Data

Read the csv file as a dataframe.

In [None]:
df = pd.read_csv("/content/heart.csv")

Let's see what kind of data we have.

In [None]:
df.head()

The dataset includes a total of **14 attributes**:

- **age**: The person's age in years

- **sex**: The person's sex (1 = male, 0 = female)

- **cp**: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

- **trestbps**: The person's resting blood pressure (mm Hg on admission to the hospital)

- **chol**: The person's cholesterol measurement in mg/dl

- **fbs**: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

- **restecg**: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

- **thalach**: The person's maximum heart rate achieved

- **exang**: Exercise induced angina (1 = yes; 0 = no)

- **oldpeak**: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

- **slope**: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

- **ca**: The number of major vessels (0-3)

- **thal**: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

- **target**: Heart disease (0 = no, 1 = yes)

In [None]:
print("Total number of patient data:" , len(df))

So, we have a total of 303 patient data included in this dataset, which is considered fairly small. 

For readability, let's change the column names that can be more easily understood. 

In [None]:
df.columns = ['age', 
              'sex', 
              'chest_pain_type', 
              'resting_blood_pressure', 
              'cholesterol', 
              'fasting_blood_sugar', 
              'rest_ecg', 
              'max_heart_rate_achieved',
              'exercise_induced_angina', 
              'st_depression', 
              'st_slope', 
              'num_major_vessels', 
              'thalassemia', 
              'target']

## 04. Explore the Dataset

In [None]:
print("TOTAL NUMBER OF PATIENTS       :", len(df))
print("Patients with    heart disease :", df.target.value_counts()[1])
print("Patients without heart disease :", df.target.value_counts()[0])
print()
sns.countplot(x='target', data=df, palette='BrBG')
plt.xticks(ticks=[0, 1], labels=['Without HD', 'With HD'])

plt.show()

We see that we have an approximately even balance between two target groups.

In [None]:
# Male Patients
male_table = df[df.sex == 1]
male_count = len(male_table)
male_patients = male_table.target.value_counts()
print("Number of MALE patients :", male_count)
print("WITH    heart disease   :", male_patients[1])
print("WITHOUT heart disease   :", male_patients[0])
print()

# Female Patients
female_table = df[df.sex == 0]
female_count = len(female_table)
female_patients = female_table.target.value_counts()
print("Number of FEMALE patients :", female_count)
print("WITH    heart disease     :", female_patients[1])
print("WITHOUT heart disease     :", female_patients[0])
print()

Notice that we have a significantly higher number of male patients, but female patients tend to have higher probability of a heart disease based on our current dataset. 

In [None]:
# Percentage
print("Percentage of Male Patients   : {:.2f}%"
       .format(male_count / (male_count + female_count) * 100))
print("Percentage of Female Patients : {:.2f}%"
       .format(female_count / (male_count + female_count) * 100))

Let's have a look at the mean values of each attribute grouped by target groups.

In [None]:
df.groupby('target').mean()

### Visualizations
We now create some visualizations to better understand our data.

In [None]:
pd.crosstab(df.sex, df.target).plot(kind="bar",
                                   figsize=(10,7), 
                                   color=['#dbb972', '#76c6ba'],
                                   alpha=0.65)

plt.title('Heart Disease Frequency for Sex', fontweight='bold', fontsize='x-large')
plt.xlabel('Sex')
plt.xticks(ticks=[0, 1], labels=['Female', 'Male'], rotation=0)
plt.ylabel('Frequency')
plt.legend(['Without HD', 'With HD'])

plt.show()

Once again, we can clearly see that we have more male patients in total, but we also have a higher proportion of female patients with heart diseases.

Now, let's look at the age distribution of patients.

In [None]:
sns.set(rc={ 'figure.figsize':(16.0, 8.0) }, style='white',
        font_scale=0.8)


ax = sns.countplot(x='age', data=df, color='#76c6ba', alpha=0.65)
ax.set(xlabel="Age", ylabel="Frequency")
plt.title('Patient Age Distribution', fontweight='bold', fontsize='x-large')

plt.show()

We can see that we have a high percentage of patients in their 50s and very low percentage of them in their 20s and 30s, which makes sense since older age groups generally have higher risks of having a heart disease.

In [None]:
pd.crosstab(df.age, df.target).plot(kind="bar",
                                    figsize=(20,7), 
                                    color=['#dbb972', '#76c6ba'],
                                    alpha=0.65)

plt.title('Heart Disease Frequency for Ages', 
          fontweight='bold', 
          fontsize='x-large')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend(['Without HD', 'With HD'])

plt.show()

Interestingly, from this chart, we can notice that a fair amount of patients in their 40s were also diagnosed with a heart disease. 

In [None]:
ax = sns.relplot(data=df, x="age", y="max_heart_rate_achieved", hue="target", 
                 palette="BrBG",
                 height=6)

ax.set(xlabel="Age", ylabel="Maximum Heart Rate")
plt.title('Maximum Heart Rate and Age', fontweight='bold', fontsize='x-large')
plt.show()

While the pattern is not completely clear from this dataset, the chart does show that patients with a heart disease tend to have slightly higher maximum heart rates.

In [None]:
pd.crosstab(df.fasting_blood_sugar, df.target).plot(kind="bar",
                                                    figsize=(15,6),
                                                    color=['#dbb972','#76c6ba'],
                                                    alpha=0.65)

plt.title('Heart Disease Frequency Based on Fasting Blood Sugar Levels', fontweight='bold', 
          fontsize='x-large')
plt.xlabel('Fasting Blood Sugar Level')
plt.xticks(ticks=[0, 1], 
           labels=['0 (Less than or equal to 120 mg/dl)', '1 (Greater than 120 mg/dl)'], 
           rotation=0)
plt.ylabel('Frequency')
plt.legend(['Without HD', 'With HD'])

plt.show()

The chart above was also quite surprising because I assumed higher fasting blood sugar levels are naturally correlated with higher risk of having a heart disease. However, the chart does not seem to show such correlations, and instead suggests the opposite is true. Once again, this may be due to small dataset size.

In [None]:
pd.crosstab(df.chest_pain_type, df.target).plot(kind="bar",
                                                figsize=(15,6),
                                                color=['#dbb972','#76c6ba'],
                                                alpha=0.65)

plt.title('Heart Disease Frequency According To Chest Pain Type', fontweight='bold', 
          fontsize='x-large')
plt.xlabel('Chest Pain Type')
plt.xticks(ticks=[0, 1, 2, 3], 
           labels=['0 (typical angina)', '1 (atypical angina)', 
                   '2 (non-anginal pain)', '3 (asymptomatic)'], 
           rotation=0)
plt.ylabel('Frequency')
plt.legend(['Without HD', 'With HD'])

plt.show()

This chart seems to strongly shows that having a chest pain type of 1, 2, or 3 suggests high probability of a heart disease. 

## 05. Data Preprocessing

In [None]:
print(df.dtypes)

# Change the following datatype into categorial variables / objects
df['sex'] = df['sex'].astype('object')
df['chest_pain_type'] = df['chest_pain_type'].astype('object')
df['fasting_blood_sugar'] = df['fasting_blood_sugar'].astype('object')
df['rest_ecg'] = df['rest_ecg'].astype('object')
df['exercise_induced_angina'] = df['exercise_induced_angina'].astype('object')
df['st_slope'] = df['st_slope'].astype('object')
df['thalassemia'] = df['thalassemia'].astype('object')

print("*"*50)
print(df.dtypes)

With these columns, we now create some dummy variables to better understand some model analysis later on.

In [None]:
df = pd.get_dummies(df)
print(df.columns)

Here we can see the dummy variables. For example, the variable, "sex", which had 0/1 representing female/male, has turned into two variables sex_0 and sex_1 (sex_0: 1 and sex_1: 0 means the patient is female and vice versa).

In [None]:
# Get Y values
y = df["target"]

# Normalize and get X values
x_data = df.drop(['target'], axis=1)
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0)

# X values
print(x_train.head(), "\n")
print(x_test.head(), "\n")
print("*"*50)

# Y values
print(y_train, "\n")
print(y_test, "\n")

## 06. Model Creation and Prediction

In [None]:
# importing alll the necessary packages to use the various classification algorithms
from sklearn.linear_model import LogisticRegression  # for Logistic Regression algorithm
from sklearn.ensemble import RandomForestClassifier  # for Random Forest Classifier
from sklearn.tree import DecisionTreeClassifier      # for using Decision Tree Algoithm
from sklearn import svm                              # for Support Vector Machine (SVM) Algorithm
from sklearn import metrics                          # for checking the model accuracy

In [None]:
# Use Support Vector Machine
model = svm.SVC()

# Train the algorithm with training input and output data
model.fit(x_train, y_train) 

# Pass the testing data to the trained model for prediction
prediction = model.predict(x_test) 

print('The accuracy of the SVM Model is:', metrics.accuracy_score(prediction, y_test))

In [None]:
model2 = LogisticRegression()

# Train the algorithm with training input and output data
model2.fit(x_train, y_train) 

# Pass the testing data to the trained model for prediction
prediction = model2.predict(x_test) 

print('The accuracy of the Logistic Regression Model is', metrics.accuracy_score(prediction, y_test))

In [None]:
model3 = RandomForestClassifier(n_estimators = 1000, random_state = 1)

# Train the algorithm with training input and output data
model3.fit(x_train, y_train)

# Pass the testing data to the trained model for prediction
prediction = model3.predict(x_test) 

print('The accuracy of the Random Forest Classifier Model is', metrics.accuracy_score(prediction, y_test))

In [None]:
from sklearn.metrics import roc_curve, auc #for model evaluation

y_pred_quant = model3.predict_proba(x_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

In [None]:
# Area under the curve
auc(fpr, tpr)

In [None]:
from sklearn.metrics import confusion_matrix #for model evaluation

confusion_matrix = confusion_matrix(y_test, prediction)

sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )

specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)

In [None]:
!pip install eli5
!pip install shap
!pip install pdpbox

In [None]:
import eli5 
from eli5.sklearn import PermutationImportance
import shap
from pdpbox import pdp, info_plots

perm = PermutationImportance(model3, random_state=1).fit(x_test, y_test)
eli5.show_weights(perm, feature_names=x_test.columns.tolist())