# Capstone project: Predicting heart disease

* [INTRODUCTION](#0)
* [IMPORTING MODULES, LOADING DATA & DATA REVIEW](#1)
* [PREPROCESSING](#2)
* [EXPLORATORY DATA ANALYSIS (EDA)](#3)    
* [SCALING, CATEGORICAL VARIABLES, SPLITTING](#4)
* [MODELS](#5)
* [CONCLUSION](#6)

## 1.1 Information About the Project
Cardiovascular disease, including diseases such as coronary artery disease and stroke, are the leading cause of mortality worldwide. The global burden of heart disease has also increased in recent years, from around 12.4 million deaths in 1990 to 19.8 million deaths in 2022 reflecting both population growth as well as aging. The prevalance of cardiovascular disease varies considerably by country, with deomgraphic factors such as age, sex and ethnicity as well as health-related and behavioural factors such as diet, cholesterol, diabetes, air pollution, obesity, tobacco use, kidney disease, physical inactivity, harmful use of alcohol, and stress. 

**Objective:**  
Being able to detect a patient's risk of heart disease from these factors could be very useful, to support earlier clinical as well as behavioural interventions. This project aims to build, finetune and deploy a binary classification predictive model that can accurately predict whether or not a patient has had a heart attack given information about that patient's demongraphic, health and behavioural characteristics.


**Scope:**  
The project will cover exploratory data analysis of this large dataset, and feature engineering to maximise the utility and efficiency of the available data for predicting heart attacks. Different types of binary classification models will be tested, compared and optimised with the best perfomring model to be taken through for deployment.

## 1.2 Description of the Dataset
The dataset is a large dataset including demographic, health and behavioural characteristics for over 100,000 patients from the USA, as well as information on whether or not they have suffered a heart attack.

- **Source:** The dataset is an open dataset from [Kaggle](https://www.kaggle.com/datasets/tarekmuhammed/patients-data-for-medical-field/data)
- **Size:** Total number of records: 237630, total number of columns: 35
- **Type:** Tabular

## 1.3 Description of the Columns

- **Target Variable:** 
HadHeartAttack: Indicator of whether the patient had a heart attack. This is a binary label (0 = no heart attack, 1 = had heart attack). We are trying to predict from a patients demongraphic, health and behavioural characteristics whether or not they will have had a heart attack.

- **Feature Variables:** A brief description of the important columns, including their data types.
PatientID: Unique identifier for each patient.
State: Geographic state of residence.
Sex: Gender of the patient.
GeneralHealth: Self-reported health status.
AgeCategory: Categorized age group of the patient.
HeightInMeters: Height of the patient (in meters).
WeightInKilograms: Weight of the patient (in kilograms).
BMI: Body Mass Index, calculated from height and weight.
HadAngina: Indicator of whether the patient experienced angina.
HadStroke: Indicator of whether the patient had a stroke.
HadAsthma: Indicator of whether the patient has asthma.
HadSkinCancer: Indicator of whether the patient had skin cancer.
HadCOPD: Indicator of whether the patient had chronic obstructive pulmonary disease (COPD).
HadDepressiveDisorder: Indicator of whether the patient was diagnosed with a depressive disorder.
HadKidneyDisease: Indicator of whether the patient had kidney disease.
HadArthritis: Indicator of whether the patient had arthritis.
HadDiabetes: Indicator of whether the patient had diabetes.
DeafOrHardOfHearing: Indicator of hearing impairment.
BlindOrVisionDifficulty: Indicator of vision impairment.
DifficultyConcentrating: Indicator of concentration difficulties.
DifficultyWalking: Indicator of walking difficulties.
DifficultyDressingBathing: Indicator of difficulties in dressing or bathing.
DifficultyErrands: Indicator of difficulties in running errands.
SmokerStatus: Status of whether the patient is a smoker.
ECigaretteUsage: Indicator of e-cigarette usage.
ChestScan: Indicator of whether the patient had a chest scan.
RaceEthnicityCategory: Race or ethnicity of the patient.
AlcoholDrinkers: Status of whether the patient consumes alcohol.
HIVTesting: Status of whether the patient was tested for HIV.
FluVaxLast12: Status of whether the patient received a flu vaccine in the last 12 months.
PneumoVaxEver: Status of whether the patient ever received a pneumococcal vaccine.
TetanusLast10Tdap: Status of whether the patient received a tetanus vaccine in the last 10 years.
HighRiskLastYear: Indicator of whether the patient was at high risk in the last year.
CovidPos: Status of whether the patient tested positive for COVID-19.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Modules, Load Data & Data Review</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [2]:
# Load common libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [3]:
df= pd.read_excel(r"C:\Users\BalintStewart\OneDrive - Health Data Research\Desktop\magnimind-assignments\Capstone project\Heart Disease\data\Patients Data ( Used for Heart Disease Prediction ).xlsx")

In [4]:
df.head()

Unnamed: 0,PatientID,State,Sex,GeneralHealth,AgeCategory,HeightInMeters,WeightInKilograms,BMI,HadHeartAttack,HadAngina,...,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,1,Alabama,Female,Fair,Age 75 to 79,1.63,84.82,32.099998,0,1,...,Never used e-cigarettes in my entire life,1,"White only, Non-Hispanic",0,0,0,1,"No, did not receive any tetanus shot in the pa...",0,1
1,2,Alabama,Female,Very good,Age 65 to 69,1.6,71.669998,27.99,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",0,0,1,1,"Yes, received Tdap",0,0
2,3,Alabama,Male,Excellent,Age 60 to 64,1.78,71.209999,22.530001,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",1,0,0,0,"Yes, received tetanus shot but not sure what type",0,0
3,4,Alabama,Male,Very good,Age 70 to 74,1.78,95.25,30.129999,0,0,...,Never used e-cigarettes in my entire life,0,"White only, Non-Hispanic",0,0,1,1,"Yes, received tetanus shot but not sure what type",0,0
4,5,Alabama,Female,Good,Age 50 to 54,1.68,78.019997,27.76,0,0,...,Never used e-cigarettes in my entire life,1,"Black only, Non-Hispanic",0,0,1,0,"No, did not receive any tetanus shot in the pa...",0,0


In [5]:
df.columns

Index(['PatientID', 'State', 'Sex', 'GeneralHealth', 'AgeCategory',
       'HeightInMeters', 'WeightInKilograms', 'BMI', 'HadHeartAttack',
       'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD',
       'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
       'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty',
       'DifficultyConcentrating', 'DifficultyWalking',
       'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus',
       'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory',
       'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver',
       'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'],
      dtype='object')

## <p style="background-color:#fea162; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Preprocessing</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Check for missing values in the data

In [6]:
df.isna().sum()
# No missing values in the data

PatientID                    0
State                        0
Sex                          0
GeneralHealth                0
AgeCategory                  0
HeightInMeters               0
WeightInKilograms            0
BMI                          0
HadHeartAttack               0
HadAngina                    0
HadStroke                    0
HadAsthma                    0
HadSkinCancer                0
HadCOPD                      0
HadDepressiveDisorder        0
HadKidneyDisease             0
HadArthritis                 0
HadDiabetes                  0
DeafOrHardOfHearing          0
BlindOrVisionDifficulty      0
DifficultyConcentrating      0
DifficultyWalking            0
DifficultyDressingBathing    0
DifficultyErrands            0
SmokerStatus                 0
ECigaretteUsage              0
ChestScan                    0
RaceEthnicityCategory        0
AlcoholDrinkers              0
HIVTesting                   0
FluVaxLast12                 0
PneumoVaxEver                0
TetanusL

## Check for duplicate values

In [69]:
df.duplicated().sum()

0

There are no missing values or duplicate values in the dataset

In [7]:
df.SmokerStatus.value_counts()

SmokerStatus
Never smoked                             142390
Former smoker                             66193
Current smoker - now smokes every day     21148
Current smoker - now smokes some days      7899
Name: count, dtype: int64

In [None]:
# Quick and dirty RF and XGboost, using age (ordinally encoded), sex, BMI and smoker status

In [9]:
df.AgeCategory.value_counts()

AgeCategory
Age 65 to 69       27547
Age 60 to 64       25685
Age 70 to 74       24946
Age 55 to 59       21422
Age 50 to 54       19154
Age 75 to 79       17679
Age 80 or older    17544
Age 40 to 44       16228
Age 45 to 49       16095
Age 35 to 39       14982
Age 30 to 34       12825
Age 18 to 24       12777
Age 25 to 29       10746
Name: count, dtype: int64

In [10]:
# Encode AgeCategory. Try ordinal encoding (should work well for tree, mught not work so well for linear models)
df['age'] = df['AgeCategory'].map({'Age 18 to 24': 0,
                                   'Age 25 to 29': 1,
                                   'Age 30 to 34': 2,
                                   'Age 35 to 39': 3,
                                   'Age 40 to 44': 4,
                                   'Age 45 to 49': 5,
                                   'Age 50 to 54': 6,
                                   'Age 55 to 59': 7,
                                   'Age 60 to 64': 8,
                                   'Age 65 to 69': 9,
                                   'Age 70 to 74': 10,
                                   'Age 75 to 79': 11,
                                   'Age 80 or older': 12})

df.age.value_counts()

age
9     27547
8     25685
10    24946
7     21422
6     19154
11    17679
12    17544
4     16228
5     16095
3     14982
2     12825
0     12777
1     10746
Name: count, dtype: int64

In [11]:
# One-hot encode smoker status? Try ordinal encoding first (since we assume some ordinaliry in risk) to make model training and testing more efficient
# Ordinal Encoding for SmokerStatus
df['smoker'] = df['SmokerStatus'].map({"Never smoked": 0,
                                       "Former smoker": 1,
                                        "Current smoker - now smokes some days": 2,
                                        "Current smoker - now smokes every day": 3})


In [16]:
# Encode sex through binary encoding (no need to one-hot encode as only 2 categories)
df['sex'] = df['Sex'].map({'Female': 0, 'Male': 1})


# Try some models for training and eval speed

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, roc_auc_score

In [14]:
df['HadHeartAttack'].value_counts()

HadHeartAttack
0    224429
1     13201
Name: count, dtype: int64

In [27]:
df['HadAngina'].value_counts()

HadAngina
0    223013
1     14617
Name: count, dtype: int64

In [33]:
# split the data into features and target
X = df[['age', 'sex', 'smoker', 'BMI']]
y = df['HadHeartAttack']

In [34]:
X.shape

(237630, 4)

In [35]:
y.shape

(237630,)

In [None]:
# split the data into train test split. Stratify the target variable so that the split leads to balanced classes in both train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [48]:
# initialise the RF classifier
rf_clf = RandomForestClassifier(class_weight = 'balanced',n_estimators = 100, random_state = 42)

# train the model
rf_clf.fit(X_train, y_train)

# make predicitons
y_pred = rf_clf.predict(X_test)
y_pred_proba = rf_clf.predict_proba(X_test)[:,1] # get the predicted probabilities for the positive class

# Recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

# ROC AUC
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC: {roc_auc:.2f}")

Recall: 0.32
ROC AUC: 0.63


## XGBoost model

XGBoost training is quite a lot faster than Random Forest

In [58]:
# Try XGBoost (faster)
from xgboost import XGBClassifier

# Calculate scale_pos_weight for imbalanced dataset
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Initialize the model
xgb_clf = XGBClassifier(
    n_estimators=100,         # Number of trees
    learning_rate=0.1,        # Step size for boosting
    max_depth=3,              # Maximum tree depth
    scale_pos_weight=scale_pos_weight,  # Handle class imbalance
    random_state=42,
    eval_metric='logloss'     # Evaluation metric
)

# train the model
xgb_clf.fit(X_train, y_train)

#generate predictions
y_train_pred = xgb_clf.predict(X_train)
y_train_pred_proba = xgb_clf.predict_proba(X_train)[:, 1]
y_pred = xgb_clf.predict(X_test)  # Predicted classes
y_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]  # Predicted probabilities

# train vs test accuracy

# Recall train
train_recall = recall_score(y_train, y_train_pred)
print(f"Train Recall: {train_recall:.2f}")
train_roc_auc = roc_auc_score(y_train, y_train_pred_proba)
print(f"Train ROC AUC: {train_roc_auc:.2f}")

# Recall test
recall = recall_score(y_test, y_pred)
print(f"Test Recall: {recall:.2f}")

# ROC AUC
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test ROC AUC: {roc_auc:.2f}")


Train Recall: 0.77
Train ROC AUC: 0.77
Test Recall: 0.76
Test ROC AUC: 0.77


XGBoost seems to be working better than non-optimised Random Forest so we can attempt to finetune XGBoost first

In [68]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200, 500],
              'learning_rate': [0.001, 0.01,0.05, 0.1,0.2, 0.3],
              'max_depth': [3, 5, 7, 10],
              'scale_pos_weight': [scale_pos_weight]}

xgb_clf = XGBClassifier(random_state = 42, eval_metric = 'logloss')

#initialise grid_search
grid_search = GridSearchCV(estimator = xgb_clf,
                           param_grid = param_grid,
                           scoring = 'recall',
                           n_jobs = -1, #use all processors
                           cv = 5,
                           )

# fit the grid search
grid_search.fit(X_train, y_train)

# display best parameters and the best score
print(f'Best params are: {grid_search.best_params_}')
print(f'Best recall score: {grid_search.best_score_}')


Best params are: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'scale_pos_weight': 17.000403999596}
Best ROC AUC score: 0.7997169065720302


n_estimators=100,         # Number of trees
    learning_rate=0.1,        # Step size for boosting
    max_depth=3,              # Maximum tree depth
    scale_pos_weight=scale_pos_weight,  # Handle class imbalance
    random_state=42,
    eval_metric='logloss'     # Evaluation metric