# Heart Attack Prediction

This Jupyter Notebook is created for the **Biotech Final Year Project** of **MNNIT Allahabad, Dept of Biotechnology**.   
The notebook contains code to predict risk of heart attack using various Machine Learning techniques based on health and heart-based parameters.

This notebook and all other relevant files are available on [Github](https://github.com/agg-geek/HeartAttackPrediction).



### Project Supervisor:
Dr. Ashutosh Mani,  
Associate Professor, Department of Biotechnology

### Project team members:
- Abhinav Aggarwal, 20200003
- Ratna Rathaur, 20200041
- Shivam Pandey, 20200049

### Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
matplotlib.style.use('ggplot')
# matplotlib.style.use('fivethirtyeight')
# matplotlib.style.use('seaborn-v0_8')

from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier




In [None]:
import warnings
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

In [None]:
np.random.seed(42)

### Import dataset

In [None]:
column_names = ['age', 'sex', 'cp', 'bp', 'chol', 'fbs', 'ecg', 'maxhr', 'angina', 'oldpeak', 'stslope', 'attack']
heart = pd.read_csv('dataset/processed.data', names=column_names, sep=',', skiprows=1)
heart.head(5)

In [None]:
# column_names = ['age', 'sex', 'cp', 'bp', 'chol', 'fbs', 'ecg', 'maxhr', 'angina', 'oldpeak', 'stslope', 'attack']
# heart2 = pd.read_csv('dataset/processed.data', names=column_names, sep=',', skiprows=1)
# heart2.sample(5)

### About the dataset


- `age`: Age of the patient (years)
- `sex`: Sex of the patient (1: Male or 0: Female)
- `cp`:  Chest pain type (0: Typical Angina, 1: Atypical Angina, 2: Non-Anginal Pain, 3: Asymptomatic)
- `bp`:  Resting blood pressure (mm Hg)
- `chol`:  Cholesterol level (mg/dL)
- `fbs`: Fasting blood sugar (1: if fbs > 120 mg/dl, 0: otherwise)
- `ecg`: Resting ECG results
    - 0: Normal
    - 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `maxhr`: Maximum heart rate achieved (bpm)  
- `angina`: Exercise Induced Angina (1: Yes, 0: No)
- `oldpeak`: ST depression induced by exercise relative to rest
- `stslope`: Slope of the peak exercise ST segment (0: upsloping, 1: flat, 2: downsloping)
- `attack`: Target variable (0 = no heart attack, 1 - 4: heart attack)

## Initial Inference

In [None]:
heart.info()

**Observations:**
- There are 918 instances.
- There are 11 features and 1 target variable.
- Many features have datatype `float64` and `object`. Many of these features can be converted to save space.

In [None]:
heart.isnull().sum()

**Observations:**  
There are no missing values in the dataset.  
If there were missing values, we could have either removed them (if their count was less), or we could have used Imputer to impute the missing values.

In [None]:
heart.duplicated().sum()

**Observation:**  
There are no duplicated rows.


In [None]:
heart.describe()

## Exploratory Data Analysis

#### Create copy of dataset for EDA

In [None]:
heart_copy = heart.copy()
heart_copy['attack'] = heart_copy['attack'].apply(lambda x: 'Attack' if x == 1 else 'No attack')
heart_copy

#### Create lists for features

In [None]:
categorical_features = []
numerical_features = []

for col in list(heart_copy.columns)[:-1]:
    if heart_copy[col].nunique() > 5:
        numerical_features.append(col)
    else:
        categorical_features.append(col)

print('Categorical Features :', *categorical_features, len(categorical_features))
print('Numerical Features :', *numerical_features, len(numerical_features))

### Univariate Analysis

#### Univariate analysis on categorical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(categorical_features, 1):
    plt.subplot(2,3,i)
    plt.title(f"Distribution of {col}")
    sns.histplot(heart_copy[col])
    plt.tight_layout()
    plt.plot()

In [None]:
# heart_copy[categorical_features].skew().sort_values(ascending=False)

**Observations:**  
- The frequency of feature values is not uniform. This maybe because some types appear more frequently than others or it maybe attributed to poor data collection techniques.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.
<!-- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead. -->
<!-- - Scales for the features are different, will require feature scaling.  -->
<!-- - Several numeric features are actually categorical. -->
<!-- - **Categorical Features:** `sex`, `cp`, `fbs`, `recg`, `angina`, `stslope`, `vessel`, `thal`, and `attack`.   -->
<!-- - **Continuous Features:** `age`, `bp`, `chol`, `maxhr`, `oldpeak`. -->

#### Univariate analysis on numerical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3,3,i)
    plt.title(f"Distribution of {col}")
    sns.histplot(heart_copy[col], kde=True, line_kws={'lw':1.5}, stat='density')
    plt.tight_layout()
    plt.plot()

In [None]:
# heart_copy[numerical_features].skew().sort_values(ascending=False)

**Observations:**  
- Scales for the features are different, will require feature scaling. 
- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.

#### Univariate analysis on target column

In [None]:
l = list(heart_copy['attack'].value_counts())
circle = [l[1] / sum(l) * 100,l[0] / sum(l) * 100]

fig, ax = plt.subplots(nrows = 1,ncols = 2,figsize = (14,5))
plt.subplot(1,2,1)
sns.histplot(heart_copy['attack'])
plt.title('Cases of Heart Disease');

plt.subplot(1,2,2)
plt.pie(circle, labels = ['No Heart Disease','Heart Disease'],autopct='%1.1f%%',startangle = 90,explode = (0.1,0))
plt.title('Heart Disease %');
plt.show()

**Observations:**
- The frequency of the target values are not very different.  This is a very balanced dataset.

### Bivariate Analysis

#### Target variable vs Categorical features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (16,18))
for i in range(len(categorical_features)):
    plt.subplot(2,3,i+1)
    ax = sns.countplot(data = heart_copy, x = categorical_features[i], hue = "attack", edgecolor = 'black')
    # for rect in ax.patches:
    #     ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
    title = categorical_features[i] + ' vs attack'
    plt.legend(['No Heart Disease','Heart Disease'])
    plt.title(title);

**Observations**
- Number of **Males** with heart attack are greater than number of males without heart attack.
- Number of **Females** with heart attack are lesser than number of females without heart attack.
- Presence of **asymptomatic** type of **chest pain** results shows a greater risk of heart attack.
- **Fasting Blood Sugar** level shows no direct relation with heart attack.
- Normal **Resting ECG** indicates a little lower risk of heart attack.
- Patients with **Exercise induced Engina** have a higher risk of heart diseases.
- Patients with **flat** ST slope have a very high probability of having heart attack.
- Patients with a non-zero **number of major vessels colored by flourosopy** have a greater risk of heart attack.
- Patients with **reversible thalassemia** are at a high risk.

#### Target variable vs Numerical features

In [None]:
# sns.pairplot(heart_copy, hue='attack')
# plt.legend('attack')
# plt.show()

We divide the numerical data into groups and then plot these groups.

In [None]:
scaling_factors = {
    'age': 5,
    'bp': 10,
    'chol': 50,
    'maxhr': 20,
    'oldpeak': 0.5 # creates a problem
}
# scale['age']

In [None]:
for i in numerical_features:
    heart_copy[f"{i}_grp"] = [int(j/scaling_factors[i]) for j in heart_copy[i]]

In [None]:
fig, ax = plt.subplots(nrows = 5,ncols = 1,figsize = (12,28))

for idx, grp in enumerate(numerical_features):
    # if idx+1 == len(numerical_features):
    #     break
    
    grp_name = f"{grp}_grp"
    plt.subplot(5,1,idx+1)
    ax = sns.countplot(data = heart_copy, x = grp_name, hue = "attack", edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 1, rect.get_height(), horizontalalignment='center', fontsize = 11)

    ax.set_xticks(range(heart_copy[grp_name].nunique()))
    mn = heart_copy[grp_name].min()
    l = heart_copy[grp_name].nunique()
    scale = scaling_factors[grp]
    # ax.set_xticklabels([f"{i}-{i+scale-1}" for i in range(scale*mn,  scale*(mn + l), scale)])
    ax.set_xticklabels([f"{i}-{i+scale-(1 if scale > 1 else 0.1)}" for i in np.arange(scale*mn,  scale*(mn + l), scale)])
    plt.legend(['No Heart Disease','Heart Disease'])
    plt.title(f"{grp_name} vs attack");

**Observations:**
- Patients with **age** > 55 have a very high risk of heart attack.
- Resting **blood pressure** of 110 and above shows a little risk of heart attack, wherease BP > 160 have a very high risk of heart attack.
- **Cholestrol level** > 250 poses a high risk.
- **Maximum heart rate** achieved between 80 - 140 poses a very high risk.
- Patients with **old peak** > 1 also have a high probability of having a heart attack.

### Multivariate Analysis

#### Target variable and Categorical features vs Numerical features

##### Sex vs Numerical Features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'sex',y = numerical_features[i],data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs sex")

**Observations:**
- **Males** have very significant heart attack risk for age > 50 and maximum heart rate < 140. For blood pressure, cholesterol and oldpeak, heart attack occurrence do not show any particular range.
- Since **female** population data points are very less as compared to male population data points, we cannot point to specific ranges or values that display cases of heart attack.

##### Chest pain type vs Numerical Features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'cp',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs cp")

**Observations:**
- **Asymptomatic** chest pain shows a very high heart attack risk. Other chest pain types do not show significant risk.

##### Fasting Blood Sugar vs Numerical features

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'fbs',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs fbs")

**Observations:**
- Above the **age** of 50, heart attack can occur irrespective of the fasting Blood Sugar level.
- Fasting Blood Sugar < 120 and **Resting BP > 130** can cause significant risk.
- **Cholesterol** can cause heart attack irrespective of the fasting blood sugar level.
- Patients with fasting blood sugar < 120 and **maximum heart rate below 140** are more prone to heart attack.
- Any **Oldpeak** value seems to cause heart attack irrespective of the fasting blood sugar level.

##### Resting ECG vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'ecg',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs ecg")

**Observations:**


- Patients with `age` > 50 are more prone to heart disease irrespective of the ECG type.
- Heart disease is diagnosed irrespective of the values of `ecg` and `bp`.
- `chol` > 200 and `ecg` type of ST shows a higher chance of heart disease.
- `maxhr` > 130 for all ECG types seems to cause high probability of heart attack.

##### Exercise Angina vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'angina',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs angina")

**Observations:**


- A presence of `angina` in patient causes a high risk of heart disease.


##### ST_Slope vs Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 3,figsize = (15,10))
for i in range(5):
    plt.subplot(2,3,i+1)
    sns.stripplot(x = 'stslope',y = numerical_features[i], data = heart_copy,hue = 'attack')
    # plt.legend(['No Heart attack', 'Heart attack']) # why this does not work?
    plt.title(f"{numerical_features[i]} vs stslope")

**Observations:**


- A flat `stslope` shows a very high probability of heart disease.

#### Target variable and Numerical features vs Numerical features

In [None]:
a = 0
fig,ax = plt.subplots(nrows = 5,ncols = 2,figsize = (15,25))
for i in range(len(numerical_features)):
    for j in range(len(numerical_features)):
        if i != j and j > i:
            a += 1
            plt.subplot(5,2,a)
            sns.scatterplot(x = numerical_features[i],y = numerical_features[j],data = heart_copy,hue = 'attack', edgecolor = 'black');
            # plt.legend(['No Heart attack', 'Heart attack']) # why does this not work?
            plt.title(f"{numerical_features[i]} vs {numerical_features[j]}")

**Observations:**


- `age` > 50, `ecg` values between 100 - 175, `chol` level of 200 - 300, `maxhr` < 140 and `oldpeak` > 0 displays high risk of heart disease.

## Feature Engineering

### Collinearity Analysis

In [None]:
cols = numerical_features.copy()
cols.append('attack');
cols

In [None]:
sns.pairplot(heart[cols], hue='attack')

In [None]:
corr_matrix = heart[cols].corr()
corr_matrix

In [None]:
sns.heatmap(round(corr_matrix, 2), annot=True)

**Observations:**
- There is very little to no correlation between variables.
- Highest correlation between features is `-0.38` for `age` and `maxhr`.
- Highest correlation between `attack` and feature is for `maxhr` (`-0.4`) and `oldpeak` (`0.4`).
- We can check if removing `oldpeak` can give any performance boost.

### Split data into train and test

In [None]:
heart2 = heart.copy()

In [None]:
heart = pd.get_dummies(heart, drop_first=True)

In [None]:
X = heart.drop(["attack"], axis=1)
y = heart["attack"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify = y, random_state = 101)

### Data Scaling

In [None]:
scaler = MinMaxScaler()
scaler

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### K Nearest Neighbours Classifier

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors=5, algorithm="kd_tree")
KNN_model.fit(X_train_scaled, y_train)
y_pred = KNN_model.predict(X_test_scaled)
y_train_pred = KNN_model.predict(X_train_scaled)

# knn_f1 = f1_score(y_test, y_pred)
# knn_acc = accuracy_score(y_test, y_pred)
# knn_recall = recall_score(y_test, y_pred)
# knn_auc = roc_auc_score(y_test, y_pred)

print("Confusion matrix for KNN is:")
print(confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report for KNN is:")
print(classification_report(y_test, y_pred))

### Logistic Regression

In [None]:
LR_model = LogisticRegression() # Since Basic accuracy outcome gives the best model accuracy results, we will implement it 
LR_model.fit(X_train_scaled, y_train)
y_pred = LR_model.predict(X_test_scaled)
y_train_pred = LR_model.predict(X_train_scaled)

# log_f1 = f1_score(y_test, y_pred)
# log_acc = accuracy_score(y_test, y_pred)
# log_recall = recall_score(y_test, y_pred)
# log_auc = roc_auc_score(y_test, y_pred)

print("Confusion matrix for Logistic Regression is:")
print(confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report for Logistic Regression is:")
print(classification_report(y_test, y_pred))

### SVM

In [None]:
SVM_model = SVC(random_state=42)
SVM_model.fit(X_train_scaled, y_train)
y_pred = SVM_model.predict(X_test_scaled)
y_train_pred = SVM_model.predict(X_train_scaled)

# svm_f1 = f1_score(y_test, y_pred)
# svm_acc = accuracy_score(y_test, y_pred)
# svm_recall = recall_score(y_test, y_pred)
# svm_auc = roc_auc_score(y_test, y_pred)

print("Confusion matrix for SVM is:")
print(confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report for SVM is:")
print(classification_report(y_test, y_pred))

### Decision Tree

In [None]:
DT_model = DecisionTreeClassifier(class_weight="balanced", random_state=42)
DT_model.fit(X_train_scaled, y_train)
y_pred = DT_model.predict(X_test_scaled)
y_train_pred = DT_model.predict(X_train_scaled)

# dt_f1 = f1_score(y_test, y_pred)
# dt_acc = accuracy_score(y_test, y_pred)
# dt_recall = recall_score(y_test, y_pred)
# dt_auc = roc_auc_score(y_test, y_pred)

print("Confusion matrix for Decision Tree is:")
print(confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report for Decision Tree is:")
print(classification_report(y_test, y_pred))