# Heart Attack - Kaggle competition  V 4.0

Author: _Aniko Maraz, PhD_

<div class="alert alert-block alert-info">
    
This is an <a href="https://www.kaggle.com/competitions/heart-attack-risk-analysis/overview">active Kaggle competition</a>  for Kudos. 
The task is to predict (on an unseen dataset) if the patient is at low or high risk of heart attack.  <br>

This notebook is the <b>final, clean version</b> of the attempts to train an accurate ML model for prediction (see Versions 1-3). This notebook contains the fine-tuned model, while data exploration (i.e. visualisations) are omitted. 
</div>

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    OneHotEncoder,
)

from sklearn.metrics import accuracy_score, classification_report

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

import jupyter_black

%load_ext jupyter_black

## DATA: GET AND EXPLORE

In [2]:
df_raw_train = pd.read_csv("data/train.csv")

## PREPROCESSING PIPELINE

In [3]:
# def function to split blood pressure data (current format: 129/90)
def split_blood_pressure(df):
    df[["Systolic", "Diastolic"]] = df["Blood Pressure"].str.split("/", expand=True)
    df["Systolic"] = pd.to_numeric(df["Systolic"])
    df["Diastolic"] = pd.to_numeric(df["Diastolic"])
    df.drop(columns=["Blood Pressure"], inplace=True)


# split cholesterol according to sample mean
cholesterol_sample_mean = df_raw_train["Cholesterol"].mean()


def split_cholesterol_sample(df):
    df["Cholesterol_sample_split"] = np.where(
        df["Cholesterol"] > cholesterol_sample_mean, 1, 0
    )


# create the new variables
df = df_raw_train.copy()

split_blood_pressure(df=df)
split_cholesterol_sample(df=df)

### Define features

In [4]:
# Defining the features and the target
X = df.drop(columns="Heart Attack Risk")
y = df["Heart Attack Risk"]

# Opt-in continuous and categorical variables
continuous_vars = [
    "Age",
    # "Cholesterol",
    "Heart Rate",
    "Exercise Hours Per Week",
    "Stress Level",
    "Sedentary Hours Per Day",
    "Income",
    "BMI",
    "Triglycerides",
    "Physical Activity Days Per Week",
    "Sleep Hours Per Day",
    "Systolic",
    "Diastolic",
    # "Exercise Total",
    # "Systolic_Diastolic_Ratio",
]

categorical_vars = [
    "Diabetes",
    "Family History",
    "Obesity",
    "Alcohol Consumption",
    "Previous Heart Problems",
    "Medication Use",
    "Cholesterol_sample_split",
    # "Smoking",
    "Sex",
    "Continent",
    "Diet",
    "Hemisphere",
    # "Country",
]

X_selected = X[continuous_vars + categorical_vars]

### Create preprocessing pipeline and train/test data

In [5]:
# Define preprocessing steps for continuous and categorical features
num_transformer = MinMaxScaler()
cat_transformer = OneHotEncoder(drop="first")

preproc_basic = ColumnTransformer(
    transformers=[
        ("num", num_transformer, continuous_vars),
        ("cat", cat_transformer, categorical_vars),
    ],
    remainder="passthrough",
)


# Create pipelines for each classifier

svm_pipe = make_pipeline(preproc_basic, SVC(random_state=6))


# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.3, random_state=6
)

## FIT and EVALUATE pipeline with competing classification models

In [6]:
svm_pipe.fit(X_train, y_train)
score = svm_pipe.score(X_test, y_test)

# Cross-validate the pipeline
cv_score = cross_val_score(svm_pipe, X_train, y_train, cv=5, scoring="accuracy").mean()
print(f"Cross-validated accuracy for svm_pipe: {cv_score}")

# Fit preprocessing on the entire dataset
X_train_preprocessed = preproc_basic.fit_transform(X_train)

# Convert the transformed data to a DataFrame
X_train_preprocessed_df = pd.DataFrame(
    X_train_preprocessed,
    columns=continuous_vars
    + list(
        preproc_basic.named_transformers_["cat"].get_feature_names_out(categorical_vars)
    ),
)

Cross-validated accuracy for svm_pipe: 0.6427553246925807


## PREPROCESS INPUT DATA

In [7]:
df_kaggle_test = pd.read_csv("data/test.csv")  # read in test data provided by Kaggle

# preprocess input data
df_kaggle_test = df_kaggle_test.copy()

split_blood_pressure(df=df_kaggle_test)
split_cholesterol_sample(df=df_kaggle_test)

X_df_kaggle_test_selected = df_kaggle_test[continuous_vars + categorical_vars]

# Create SVM pipeline with best parameters
best_params = {"C": 0.0001, "kernel": "linear", "gamma": "scale", "class_weight": None}

svm_pipe = Pipeline(
    [
        ("preprocessor", preproc_basic),
        ("classifier", SVC(**best_params, random_state=6)),
    ]
)

## FIT

In [8]:
# Train the SVM model on the entire preprocessed training dataset
svm_pipe.fit(X_selected, y)

## PREDICT

In [9]:
# numpy.set_printoptions(threshold=sys.maxsize)

prediction = svm_pipe.predict(X_df_kaggle_test_selected)
prediction

array([0, 0, 0, ..., 0, 0, 0])