# Heart Attack Prediction

This Jupyter Notebook is created for the **Biotech Final Year Project** of **MNNIT Allahabad, Dept of Biotechnology**.   
The notebook contains code to predict risk of heart attack using various Machine Learning techniques based on health and heart-based parameters.

This notebook and all other relevant files are available on [Github](https://github.com/agg-geek/HeartAttackPrediction).



### Project Supervisor:
Dr. Ashutosh Mani,  
Associate Professor, Department of Biotechnology

### Project team members:
- Abhinav Aggarwal, 20200003
- Ratna Rathaur, 20200041
- Shivam Pandey, 20200049

### Import packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
matplotlib.style.use('ggplot')
# matplotlib.style.use('fivethirtyeight')
# matplotlib.style.use('seaborn-v0_8')

# # For Chi square test in feature selection
# from sklearn.feature_selection import SelectKBest
# from sklearn.feature_selection import chi2

# # For ANOVA test in feature selection
# from sklearn.feature_selection import f_classif

# For Data scaling
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

### Import dataset

In [2]:
column_names = ['age', 'sex', 'cp', 'bp', 'chol', 'fbs', 'ecg', 'maxhr', 'angina', 'oldpeak', 'stslope', 'vessel', 'thal', 'attack']
heart = pd.read_csv('dataset/cleveland.data', names=column_names, sep=',', na_values='?')
heart.sample(5)

Unnamed: 0,age,sex,cp,bp,chol,fbs,ecg,maxhr,angina,oldpeak,stslope,vessel,thal,attack
41,40.0,1.0,1.0,140.0,199.0,0.0,0.0,178.0,1.0,1.4,1.0,0.0,7.0,0
263,44.0,1.0,3.0,120.0,226.0,0.0,0.0,169.0,0.0,0.0,1.0,0.0,3.0,0
105,54.0,1.0,2.0,108.0,309.0,0.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0
159,68.0,1.0,3.0,118.0,277.0,0.0,0.0,151.0,0.0,1.0,1.0,1.0,7.0,0
274,59.0,1.0,1.0,134.0,204.0,0.0,0.0,162.0,0.0,0.8,1.0,2.0,3.0,1


In [None]:
# column_names = ['age', 'sex', 'cp', 'bp', 'chol', 'fbs', 'ecg', 'maxhr', 'angina', 'oldpeak', 'stslope', 'vessel', 'thal', 'attack']
# heart2 = pd.read_csv('dataset/cleveland.data', names=column_names, sep=',')
# heart2.sample(5)

### About the dataset


- `age`: Age of the patient (years)
- `sex`: Sex of the patient (1: Male or 0: Female)
- `cp`:  Chest pain type (0: Typical Angina, 1: Atypical Angina, 2: Non-Anginal Pain, 3: Asymptomatic)
- `bp`:  Resting blood pressure (mm Hg)
- `chol`:  Cholesterol level (mg/dL)
- `fbs`: Fasting blood sugar (1: if fbs > 120 mg/dl, 0: otherwise)
- `ecg`: Resting ECG results
    - 0: Normal
    - 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `maxhr`: Maximum heart rate achieved (bpm)  
- `angina`: Exercise Induced Angina (1: Yes, 0: No)
- `oldpeak`: ST depression induced by exercise relative to rest
- `stslope`: Slope of the peak exercise ST segment (0: upsloping, 1: flat, 2: downsloping)
- `vessel`: Number of major vessels colored by flourosopy (0 - 4)
- `thal`: Thalassemia blood disorder (3 (0) = normal; 6 (1) = fixed defect; 7 (2) = reversable defect)
- `attack`: Target variable (0 = no heart attack, 1 - 4: heart attack)

## Initial Inference

In [None]:
heart.info()

**Observations:**
- There are 303 instances.
- There are 13 features and 1 target variable.
- All features have datatype `float64`. Many of these features can be converted to `int64` to save space.

In [None]:
heart.isnull().sum()

**Observations:**  
There are 4 missing values in `vessel` and 2 missing values in `thal`.  
To handle these missing values, we will need to either remove the instances containing the missing values or fill them.  
We do not have any data to fill the missing values with. We could either fill them with the mean value of their corresponding columns or just remove them.  
Since only 6 instances will be removed if we remove the missing values, we will simply remove the missing values.

In [3]:
heart.dropna(inplace=True)

In [None]:
heart.duplicated().sum()

**Observation:**  
There are no duplicated rows.


#### Update dataset columns

In [4]:
heart['cp'] = heart['cp'].apply(lambda x: x-1)

In [5]:
def change_thal(x):
    if x == 3:
        return 0
    elif x == 6:
        return 1
    else:
        return 2

heart['thal'] = heart['thal'].apply(change_thal)

In [None]:
heart['attack'].value_counts()

**Observations:**  
As mentioned in the dataset description, the target column `attack` is a categorical column with 0 denoting no heart attack and other values denoting heart attack.  
We will transform the column such that 0 indicates no heart attack and 1 indicates heart attack.

In [6]:
heart['attack'] = heart['attack'].apply(lambda x: 0 if x == 0 else 1)
heart['attack'].value_counts()

attack
0    160
1    137
Name: count, dtype: int64

In [None]:
heart.describe()

## Exploratory Data Analysis

#### Utility functions

In [7]:
def check_balance(df, target_column, risk_value, not_risk_value):
    risk = len(df[df[target_column] == risk_value])
    no_risk = len(df[df[target_column] == not_risk_value])
    total = risk + no_risk
    # print(risk, no_risk, total)
    print(f"Percentage Risk: {risk / total * 100}%")
    print(f"Percentage Not Risk: {no_risk / total * 100}%")

#### Create copy of dataset for EDA

In [8]:
heart_copy = heart.copy()
heart_copy['attack'] = heart_copy['attack'].apply(lambda x: 'Attack' if x == 1 else 'No attack')
heart_copy

Unnamed: 0,age,sex,cp,bp,chol,fbs,ecg,maxhr,angina,oldpeak,stslope,vessel,thal,attack
0,63.0,1.0,0.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,1,No attack
1,67.0,1.0,3.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,0,Attack
2,67.0,1.0,3.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,2,Attack
3,37.0,1.0,2.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,0,No attack
4,41.0,0.0,1.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,0,No attack
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,0.0,3.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,2,Attack
298,45.0,1.0,0.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,2,Attack
299,68.0,1.0,3.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,2,Attack
300,57.0,1.0,3.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,2,Attack


#### Create lists for features

In [9]:
categorical_features = []
numerical_features = []

for col in list(heart_copy.columns)[:-1]:
    if heart_copy[col].nunique() > 5:
        numerical_features.append(col)
    else:
        categorical_features.append(col)

print('Categorical Features :', *categorical_features, len(categorical_features))
print('Numerical Features :', *numerical_features, len(numerical_features))

Categorical Features : sex cp fbs ecg angina stslope vessel thal 8
Numerical Features : age bp chol maxhr oldpeak 5


In [None]:
# features = numerical_features.copy()
# features.append('attack')
# features

### Univariate Analysis

In [None]:
# # %matplotlib inline
# heart_copy.hist(bins=15, figsize=(16,10)) #figsize = (width, height)
# plt.show()

#### Univariate analysis on categorical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(categorical_features, 1):
    plt.subplot(3,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(heart_copy[col])
    plt.tight_layout()
    plt.plot()

In [None]:
heart_copy[categorical_features].skew().sort_values(ascending=False)

**Observations:**  
- The frequency of feature values is not uniform. This maybe because some types appear more frequently than others or it maybe attributed to poor data collection techniques.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.
<!-- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead. -->
<!-- - Scales for the features are different, will require feature scaling.  -->
<!-- - Several numeric features are actually categorical. -->
<!-- - **Categorical Features:** `sex`, `cp`, `fbs`, `recg`, `angina`, `stslope`, `vessel`, `thal`, and `attack`.   -->
<!-- - **Continuous Features:** `age`, `bp`, `chol`, `maxhr`, `oldpeak`. -->

#### Univariate analysis on numerical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3,3,i)
    plt.title(f"Distribution of {col} Data")
    # sns.kdeplot(heart_copy[col], linewidth=1)
    sns.histplot(heart_copy[col], kde=True, line_kws={'lw':1.5}, stat='density')
    # sns.histplot(heart_copy[col], kde=True, line_kws={'lw':1.5}, stat='density', kde_kws=dict(cut=3))
    # sns.distplot(heart_copy[col], kde_kws={'bw':1})
    plt.tight_layout()
    plt.plot()

In [None]:
heart_copy[numerical_features].skew().sort_values(ascending=False)

**Observations:**  
- Scales for the features are different, will require feature scaling. 
- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.

#### Univariate analysis on target column

In [None]:
# check_balance(heart_copy, 'attack', 1, 0)

In [None]:
l = list(heart_copy['attack'].value_counts())
circle = [l[1] / sum(l) * 100,l[0] / sum(l) * 100]

fig, ax = plt.subplots(nrows = 1,ncols = 2,figsize = (14,5))
plt.subplot(1,2,1)
sns.histplot(heart_copy['attack'])
plt.title('Cases of Heart Disease');

plt.subplot(1,2,2)
plt.pie(circle, labels = ['No Heart Disease','Heart Disease'],autopct='%1.1f%%',startangle = 90,explode = (0.1,0))
plt.title('Heart Disease %');
plt.show()

**Observations:**
- The frequency of the target values are not very different.  This is a very balanced dataset.

### Bivariate Analysis

#### Target variable vs Categorical features

In [None]:
fig, ax = plt.subplots(nrows = 3,ncols = 3,figsize = (16,18))
# fig, ax = plt.subplots(nrows = 4,ncols = 2,figsize = (10, 20))
for i in range(len(categorical_features)):
    plt.subplot(3,3,i+1)
    # plt.subplot(4,2,i+1)
    ax = sns.countplot(data = heart_copy, x = categorical_features[i], hue = "attack", edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
    title = categorical_features[i] + ' vs attack'
    plt.legend(['No Heart Disease','Heart Disease'])
    plt.title(title);

**Observations**
- Number of **Males** with heart attack are greater than number of males without heart attack.
- Number of **Females** with heart attack are lesser than number of females without heart attack.
- Presence of **asymptomatic** type of **chest pain** results shows a greater risk of heart attack.
- **Fasting Blood Sugar** level shows no direct relation with heart attack.
- Normal **Resting ECG** indicates a little lower risk of heart attack.
- Patients with **Exercise induced Engina** have a higher risk of heart diseases.
- Patients with **flat** ST slope have a very high probability of having heart attack.
- Patients with a non-zero **number of major vessels colored by flourosopy** have a greater risk of heart attack.
- Patients with **reversible thalassemia** are at a high risk.

#### Target variable vs Numerical features

In [None]:
sns.pairplot(heart_copy, hue='attack')
plt.legend('attack')
plt.show()

We divide the numerical data into groups and then plot these groups.

In [None]:
scaling_factors = {
    'age': 5,
    'bp': 10,
    'chol': 50,
    'maxhr': 20,
    'oldpeak': 0.5 # creates a problem
}
# scale['age']

In [None]:
# heart_copy = heart.copy()
# heart_copy['attack'] = heart_copy['attack'].apply(lambda x: 'Attack' if x == 1 else 'No attack')
# # heart_copy

for i in numerical_features:
    heart_copy[f"{i}_grp"] = [int(j/scaling_factors[i]) for j in heart_copy[i]]

In [None]:
fig, ax = plt.subplots(nrows = 5,ncols = 1,figsize = (12,28))

for idx, grp in enumerate(numerical_features):
    # if idx+1 == len(numerical_features):
    #     break
    
    grp_name = f"{grp}_grp"
    plt.subplot(5,1,idx+1)
    ax = sns.countplot(data = heart_copy, x = grp_name, hue = "attack", edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 1, rect.get_height(), horizontalalignment='center', fontsize = 11)

    ax.set_xticks(range(heart_copy[grp_name].nunique()))
    mn = heart_copy[grp_name].min()
    l = heart_copy[grp_name].nunique()
    scale = scaling_factors[grp]
    # ax.set_xticklabels([f"{i}-{i+scale-1}" for i in range(scale*mn,  scale*(mn + l), scale)])
    ax.set_xticklabels([f"{i}-{i+scale-(1 if scale > 1 else 0.1)}" for i in np.arange(scale*mn,  scale*(mn + l), scale)])
    plt.legend(['No Heart Disease','Heart Disease'])
    plt.title(f"{grp_name} vs attack");

**Observations:**
- Patients with **age** > 55 have a very high risk of heart attack.
- Resting **blood pressure** of 110 and above shows a little risk of heart attack, wherease BP > 160 have a very high risk of heart attack.
- **Cholestrol level** > 250 poses a high risk.
- **Maximum heart rate** achieved between 80 - 140 poses a very high risk.
- Patients with **old peak** > 1 also have a high probability of having a heart attack.

### Multivariate Analysis

#### Target variable and Categorical features vs Numerical features

##### Sex vs Numerical Features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'sex',y = numerical_features[i],data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**
- **Males** have very significant heart attack risk for age > 50 and maximum heart rate < 140. For blood pressure, cholesterol and oldpeak, heart attack occurrence do not show any particular range.
- Since **female** population data points are very less as compared to male population data points, we cannot point to specific ranges or values that display cases of heart attack.

##### Chest pain type vs Numerical Features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'cp',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**
- **Asymptomatic** chest pain shows a very high heart attack risk. Other chest pain types do not show significant risk.

##### Fasting Blood Sugar vs Numerical features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'fbs',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**
- Above the **age** of 50, heart attack can occur irrespective of the fasting Blood Sugar level.
- Fasting Blood Sugar < 120 and **Resting BP > 130** can cause significant risk.
- **Cholesterol** can cause heart attack irrespective of the fasting blood sugar level.
- Patients with fasting blood sugar < 120 and **maximum heart rate below 140** are more prone to heart attack.
- Any **Oldpeak** value seems to cause heart attack irrespective of the fasting blood sugar level.

##### Resting ECG vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'ecg',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**


##### Exercise Angina vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'angina',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**


##### ST_Slope vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'stslope',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**


#### Target variable and Numerical features vs Numerical features

In [None]:
a = 0
fig,ax = plt.subplots(nrows = 5,ncols = 2,figsize = (15,25))
for i in range(len(numerical_features)):
    for j in range(len(numerical_features)):
        if i != j and j > i:
            a += 1
            plt.subplot(5,2,a)
            sns.scatterplot(x = numerical_features[i],y = numerical_features[j],data = heart_copy,hue = 'attack', edgecolor = 'black');
            # plt.legend(['No Heart attack', 'Heart attack']) # why does this not work?
            plt.title(f"{numerical_features[i]} vs {numerical_features[j]}")

**Observations:**


## Feature Engineering

### Collinearity Analysis

cp, ecg, thal

In [None]:
heart.describe()

In [None]:
cols = numerical_features.copy()
cols.append('attack');

In [None]:
sns.pairplot(heart[cols], hue='attack')

In [None]:
corr_matrix = heart[cols].corr()
corr_matrix

In [None]:
sns.heatmap(round(corr_matrix, 2), annot=True)

**Observations:**
- There is very little to no correlation between variables.
- Highest correlation between features is `-0.39` for `age` and `maxhr`.
- Highest correlation between `attack` and feature is for `oldpeak` (`0.42`) and `maxhr` (`-0.42`).
- We can check if removing `oldpeak` can give any performance boost.

## Train ML algorithms

### Split data into train and test

### Utility function for Prediction

### K Nearest Neighbours Classifier

### Logistic Regression