# Heart Attack - Kaggle competition V 5.0  
### Author: Aniko Maraz, PhD

<br> <br> Note: This is the 2nd version of the improved model, currently running in production: 
<br> <br>
https://fake-heart-attack.streamlit.app/ <br> <br> 
This notebook includes an **XGBoost*** model with **probability estimation**. This version is optimised for __*precision*__ (not accuracy as required on Kaggle).

I created this version because when predicting heart risk, not accuracy, but precision should be used, as the latter metric is a better indicator of how well the model picks up the positive cases. I also wanted to overcome the the possible shortcoming of the challenge, in which the best model (=highest accuracy) in Version 1 predicted all zeros (=low risk) on the ~1700 test cases, although in the train dataset there are ~35% positive cases. This indicates that the train and test datasets are possibly very different. The current version resulted in the prediction of 312 positive cases. 
<br> <br>
Data exploration, feature engineering, etc. can be found in [a prettier version](https://github.com/anikomaraz/heart_attack_kaggle/blob/main/notebooks/heart_attack_v3_clean_KaggleV1.ipynb). 

Further info and versions in my [Git Repo](https://github.com/anikomaraz/heart_attack_kaggle). 

# Imports

In [12]:
import sys
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    OneHotEncoder,
)

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, classification_report, precision_score

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

import jupyter_black

%load_ext jupyter_black

The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black


## DATA: GET AND EXPLORE

In [2]:
df_raw_train = pd.read_csv("../data/train.csv")

In [3]:
# def function to split blood pressure data (current format: 129/90)
def split_blood_pressure(df):
    df[["Systolic", "Diastolic"]] = df["Blood Pressure"].str.split("/", expand=True)
    df["Systolic"] = pd.to_numeric(df["Systolic"])
    df["Diastolic"] = pd.to_numeric(df["Diastolic"])
    df.drop(columns=["Blood Pressure"], inplace=True)


# split cholesterol according to sample mean
cholesterol_sample_mean = df_raw_train["Cholesterol"].mean()


def split_cholesterol_sample(df):
    df["Cholesterol_sample_split"] = np.where(
        df["Cholesterol"] > cholesterol_sample_mean, 1, 0
    )


# create the new variables
df = df_raw_train.copy()

split_blood_pressure(df=df)
split_cholesterol_sample(df=df)

### Define features

In [4]:
# Defining the features and the target
X = df.drop(columns="Heart Attack Risk")
y = df["Heart Attack Risk"]

# Opt-in continuous and categorical variables
continuous_vars = [
    "Age",
    "Heart Rate",
    "Exercise Hours Per Week",
    "Stress Level",
    "Sedentary Hours Per Day",
    "Income",
    "BMI",
    "Triglycerides",
    "Physical Activity Days Per Week",
    "Sleep Hours Per Day",
    "Systolic",
    "Diastolic",
]

categorical_vars = [
    "Diabetes",
    "Family History",
    "Obesity",
    "Alcohol Consumption",
    "Previous Heart Problems",
    "Medication Use",
    "Cholesterol_sample_split",
    "Sex",
    "Continent",
    "Diet",
    "Hemisphere",
]

X_selected = X[continuous_vars + categorical_vars]

### Create preprocessing pipeline and train/test data

In [10]:
# Define preprocessing steps for continuous and categorical features
num_transformer = MinMaxScaler()
cat_transformer = OneHotEncoder(drop="first")

preproc_basic = ColumnTransformer(
    transformers=[
        ("num", num_transformer, continuous_vars),
        ("cat", cat_transformer, categorical_vars),
    ],
    remainder="passthrough",
)

# Create pipelines for SVC
xgb_pipe = make_pipeline(preproc_basic, XGBClassifier(random_state=6))

# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.3, random_state=6
)

## TRAIN AND TUNE THRESHOLD FOR PRECISION

In [13]:
# Fit the pipeline
xgb_pipe.fit(X_train, y_train)

# Get predicted probabilities for the training set
train_probs = xgb_pipe.predict_proba(X_train)[:, 1]

# Evaluate thresholds
thresholds = np.linspace(0.4, 0.6, 50)
best_threshold = None
best_precision = 0.0

for threshold in thresholds:
    # Convert probabilities to binary predictions based on the threshold
    train_predictions = (train_probs > threshold).astype(int)

    # Evaluate precision
    precision = precision_score(y_train, train_predictions)

    # Check if this threshold gives better precision
    if precision > best_precision:
        best_precision = precision
        best_threshold = threshold

# Print the best threshold found
print(f"Best threshold: {best_threshold} with precision score: {best_precision}")

Best threshold: 0.5306122448979592 with precision score: 0.9994206257242179


## Apply the best treshold to the test set

In [24]:
# Predict probabilities for the test set
test_probs = xgb_pipe.predict_proba(X_test)[:, 1]
test_predictions = (test_probs > best_threshold).astype(int)

# Evaluate precision on the test set with the tuned threshold
test_precision = precision_score(y_test, test_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test precision: {test_precision}")
print(f"Test accuracy: {test_accuracy}")

Test precision: 0.3554987212276215
Test accuracy: 0.5891583452211127


## Preprocess input data

In [25]:
df_kaggle_test = pd.read_csv("../data/test.csv")  # read in test data provided by Kaggle

# preprocess input data
df_kaggle_test = df_kaggle_test.copy()

split_blood_pressure(df=df_kaggle_test)
split_cholesterol_sample(df=df_kaggle_test)

X_df_kaggle_test_selected = df_kaggle_test[continuous_vars + categorical_vars]

## Predict on Kaggle test set and save submission

In [27]:
# Predict probabilities for the Kaggle test set
kaggle_test_probs = xgb_pipe.predict_proba(X_df_kaggle_test_selected)[:, 1]

# Apply the best threshold to Kaggle test set predictions
kaggle_test_predictions = (kaggle_test_probs > best_threshold).astype(int)

# Prepare submission dataframe
df_kaggle_test = pd.read_csv("../data/test.csv")
df_kaggle_predicted_V5 = {
    "Patient ID": df_kaggle_test["Patient ID"],
    "Heart Attack Risk": kaggle_test_predictions,
}
df_kaggle_predicted_V5_xgb_precision = pd.DataFrame(df_kaggle_predicted_V5)

In [28]:
# Save submission to CSV
df_kaggle_predicted_V5_xgb_precision.to_csv(
    "../submission/df_kaggle_predicted_V5_xgb_precision.csv", index=False
)

In [29]:
# Number of cases in the unseen Kaggle test set
len(df_kaggle_test)

1753

In [30]:
sum(kaggle_test_predictions)

339