# Heart Attack Prediction

This Jupyter Notebook is created for the **Biotech Final Year Project** of **MNNIT Allahabad, Dept of Biotechnology**.   
The notebook contains code to predict risk of heart attack using various Machine Learning techniques based on health and heart-based parameters.

This notebook and all other relevant files are available on [Github](https://github.com/agg-geek/HeartAttackPrediction).



### Project Supervisor:
Dr. Ashutosh Mani,  
Associate Professor, Department of Biotechnology

### Project team members:
- Abhinav Aggarwal, 20200003
- Ratna Rathaur, 20200041
- Shivam Pandey, 20200049

### Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
matplotlib.style.use('ggplot')
# matplotlib.style.use('fivethirtyeight')
# matplotlib.style.use('seaborn-v0_8')

### Import dataset

In [None]:
column_names = ['age', 'sex', 'cp', 'bp', 'chol', 'fbs', 'ecg', 'maxhr', 'angina', 'oldpeak', 'stslope', 'vessel', 'thal', 'attack']
heart = pd.read_csv('dataset/cleveland.data', names=column_names, sep=',', na_values='?')
heart.sample(5)

### About the dataset


- `age`: Age of the patient (years)
- `sex`: Sex of the patient (1: Male or 0: Female)
- `cp`:  Chest pain type (0: Typical Angina, 1: Atypical Angina, 2: Non-Anginal Pain, 3: Asymptomatic)
- `bp`:  Resting blood pressure (mm Hg)
- `chol`:  Cholesterol level (mg/dL)
- `fbs`: Fasting blood sugar (1: if fbs > 120 mg/dl, 0: otherwise)
- `ecg`: Resting ECG results
    - 0: Normal
    - 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `maxhr`: Maximum heart rate achieved (bpm)  
- `angina`: Exercise Induced Angina (1: Yes, 0: No)
- `oldpeak`: ST depression induced by exercise relative to rest
- `stslope`: Slope of the peak exercise ST segment (0: upsloping, 1: flat, 2: downsloping)
- `vessel`: Number of major vessels colored by flourosopy (0 - 4)
- `thal`: Thalassemia blood disorder (3 = normal; 6 = fixed defect; 7 = reversable defect)
- `attack`: Target variable (0 = no heart attack, 1 - 4: heart attack)

## Initial Inference

In [None]:
heart.info()

**Observations:**
- There are 303 instances.
- There are 13 features and 1 target variable.
- All features have datatype `float64`. Many of these features can be converted to `int64` to save space.

In [None]:
heart.isnull().sum()

**Observations:**  
There are 4 missing values in `vessel` and 2 missing values in `thal`.  
To handle these missing values, we will need to either remove the instances containing the missing values or fill them.  
We do not have any data to fill the missing values with. We could either fill them with the mean value of their corresponding columns or just remove them.  
Since only 6 instances will be removed if we remove the missing values, we will simply remove the missing values.

In [None]:
heart.dropna(inplace=True)

In [None]:
heart.duplicated().sum()

**Observation:**  
There are no duplicated rows.


In [None]:
# heart['attack']
heart['attack'].value_counts()

**Observations:**  
As mentioned in the dataset description, the target column `attack` is a categorical column with 0 denoting no heart attack and other values denoting heart attack.  
We will transform the column such that 0 indicates no heart attack and 1 indicates heart attack.

In [None]:
heart['attack'] = heart['attack'].apply(lambda x: 0 if x == 0 else 1)
# heart['attack']
heart['attack'].value_counts()

In [None]:
heart.describe()

## Exploratory Data Analysis

### Utility functions

In [None]:
def check_balance(df, target_column, risk_value, not_risk_value):
    risk = len(df[df[target_column] == risk_value])
    no_risk = len(df[df[target_column] == not_risk_value])
    total = risk + no_risk
    # print(risk, no_risk, total)
    print(f"Percentage Risk: {risk / total * 100}%")
    print(f"Percentage Not Risk: {no_risk / total * 100}%")

In [None]:
categorical_features = []
numerical_features = []

for col in list(heart.columns)[:-1]:
    if heart[col].nunique() > 5:
        numerical_features.append(col)
    else:
        categorical_features.append(col)

print('Categorical Features :', *categorical_features)
print('Numerical Features :', *numerical_features)

In [None]:
# # %matplotlib inline
# heart.hist(bins=15, figsize=(16,10)) #figsize = (width, height)
# plt.show()

### Univariate Analysis

#### Univariate analysis on categorical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(categorical_features, 1):
    plt.subplot(3,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(heart[col])
    plt.tight_layout()
    plt.plot()

In [None]:
heart[categorical_features].skew().sort_values(ascending=False)

**Observations:**  
- The frequency of feature values is not uniform. This maybe because some types appear more frequently than others or it maybe attributed to poor data collection techniques.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.
<!-- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead. -->
<!-- - Scales for the features are different, will require feature scaling.  -->
<!-- - Several numeric features are actually categorical. -->
<!-- - **Categorical Features:** `sex`, `cp`, `fbs`, `recg`, `angina`, `stslope`, `vessel`, `thal`, and `attack`.   -->
<!-- - **Continuous Features:** `age`, `bp`, `chol`, `maxhr`, `oldpeak`. -->

#### Univariate analysis on numerical columns

In [None]:
plt.figure(figsize=(10,8))
for i, col in enumerate(numerical_features, 1):
    plt.subplot(3,3,i)
    plt.title(f"Distribution of {col} Data")
    # sns.kdeplot(heart[col], linewidth=1)
    sns.histplot(heart[col], kde=True, line_kws={'lw':1.5}, stat='density')
    # sns.histplot(heart[col], kde=True, line_kws={'lw':1.5}, stat='density', kde_kws=dict(cut=3))
    # sns.distplot(heart[col], kde_kws={'bw':1})
    plt.tight_layout()
    plt.plot()

In [None]:
heart[numerical_features].skew().sort_values(ascending=False)

**Observations:**  
- Scales for the features are different, will require feature scaling. 
- Standardization using `StandardScaler` shouldn't be used to scale the data.  Normalization should be performed so something like `MinMaxScaler` can be used instead.
- Distributions are not normally distributed (i.e. Gaussian). This will limit model performance for models which assume data to be normally distributed.

#### Univariate analysis on target column

In [None]:
check_balance(heart, 'attack', 1, 0)

**Observations:**
- The frequency of the target values are not very different.  This is a very balanced dataset.