# Heart Attack Analysis and Prediction

## About Dataset

This dataset contains medical data of different patients having various health inicators using which we can analyze and predict the risk of heart attacks more accurately.

## Source

This dataset is present in Kaggle in the following link:
> https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

## Data Dictionary

The dataset includes the following features:

- **Age**: Age of the patient.
- **Sex**: Sex of the patient
- **cp**: Chest pain type
  - Value 1: Typical angina
  - Value 2: Atypical angina
  - Value 3: Non-anginal pain
  - Value 4: Asymptomatic
- **trtbps**: Resting blood pressure (in mm Hg)
- **chol**: Cholesterol in mg/dl fetched via BMI sensor
- **fbs**: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- **rest_ecg**: Resting electrocardiographic results
  - Value 0: Normal
  - Value 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria
- **thalachh**: Maximum heart rate achieved
- **exng**: Exercise-induced angina (1 = yes; 0 = no)
- **oldpeak**: Numeric Data. This represents ST depression induced by exercise relative to rest for the patients.
- **slp**: This represents the slope of the peak exercise ST segment for the patients. Values are 0,1 and 2.
- **caa**: Number of major vessels (0-3)
- **thal**: Categorical Data. The thalassemia level in blood of patients. Values are 0, 1, 2 and 3.
- **output**: Heart attack risk indicator (0 = less chance of heart attack, 1 = more chance of heart attack)

## Problem Statements

1. **Model Training**: The objective of model training is to train the model with this data.
2. **Model Evaluation**: Evaluate the performance of the model with different metrics such as accuracy, precision, recall and f1 score.
3. **Model Optimization**: The objective of model optimization is to find the optimal model with hyperparameter tuning to avoid overfitting or underfitting and improve the performace of the model.

### Load Libraries

In [80]:
# General Libraries
import pandas as pd
import numpy as np
import os
import warnings

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model
from sklearn.tree import DecisionTreeClassifier

# Evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Model Save
import pickle

### Settings

In [84]:
# Warning
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "heart_wo.csv")
# csv_path = os.path.join(data_path, "heart.csv")
csv_path = os.path.join(data_path, "heart_selected.csv")

### Load Data

In [65]:
df = pd.read_csv(csv_path)

In [66]:
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,150,0,2.3,0,0,1,1
1,37,1,2,130,250,187,0,3.5,0,0,2,1
2,41,0,1,130,204,172,0,1.4,2,0,2,1
3,56,1,1,120,236,178,0,0.8,2,0,2,1
4,57,0,0,120,354,163,1,0.6,2,0,2,1


### Data Preparation for Training

In [67]:
# Separate the input and output features
X = df.iloc[:, : -1]
y = df.iloc[:, -1]

In [68]:
# Split Training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [69]:
# Scale the data
scaler = StandardScaler()
X_s = scaler.fit_transform(X)
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

In [70]:
# Train a model and evaluate
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict Train
    y_train_pred = model.predict(X_train_s)
    # Predict test
    y_test_pred = model.predict(X_test_s)

    # Print evaluation metrics for train
    print("Tarining Scores")
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): .2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred): .2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred): .2f}")
    print(f"F1: {f1_score(y_train, y_train_pred): .2f}")
    # Print evaluation metrics for test
    print("Testing Scores")
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): .2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred): .2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred): .2f}")
    print(f"F1: {f1_score(y_test, y_test_pred): .2f}")

In [72]:
model = DecisionTreeClassifier()
train_evaluate(model)

Tarining Scores
Accuracy:  1.00
Precision:  1.00
Recall:  1.00
F1:  1.00
Testing Scores
Accuracy:  0.85
Precision:  0.93
Recall:  0.78
F1:  0.85


### Model Optimization

Find the optimal model using hyperparameter tuning.

In [55]:
# Define the hyperparameter dictioanry
dt_dict = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": [None, 2, 3, 4, 5, 6, 7, 8, 9],
    "min_samples_split": [2, 3, 5],
    "min_samples_leaf": [1, 2, 3, 4, 5],
    "max_leaf_nodes": [None, 1, 2, 3],
    "min_impurity_decrease": [0.0, 0.001, 0.01, 0.1]
}

In [56]:
# Tune the hyperparameter using GridSearchCV and find the best parameters
def tune_hyperparameter(model):
    # Define Hyperparameter
    gsc = GridSearchCV(estimator = model,
                      param_grid= dt_dict,
                      cv=8,
                      verbose= 1,
                      scoring= "precision")
    # Train the model
    gsc.fit(X_s, y)
    
    print(f"Best Score: {gsc.best_score_}")
    return gsc.best_params_

In [57]:
dtree = DecisionTreeClassifier()
best_params = tune_hyperparameter(dtree)
print(f"Best Paramaters: {best_params}")

Fitting 8 folds for each of 8640 candidates, totalling 69120 fits
Best Score: 0.8602106889666622
Best Paramaters: {'criterion': 'gini', 'max_depth': 4, 'max_leaf_nodes': 3, 'min_impurity_decrease': 0.001, 'min_samples_leaf': 5, 'min_samples_split': 3, 'splitter': 'random'}


In [79]:
# Define the model with best params
omodel = DecisionTreeClassifier(**best_params)

# Train the model
train_evaluate(omodel)

Tarining Scores
Accuracy:  0.76
Precision:  0.80
Recall:  0.74
F1:  0.77
Testing Scores
Accuracy:  0.80
Precision:  0.92
Recall:  0.69
F1:  0.79


### Colclusion

Model with default hyperparameter gives the best performace for this data set.

In [85]:
# Save the model
dt_model_path = os.path.join(model_path, "dt_model.pkl")
with open(dt_model_path, "wb") as model_file:
    pickle.dump(model, model_file)