# ML Zoomcamp 2023 &ndash; Midterm Project

The _Heart failure clinical records_ dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features. The data set can be retrieved from the UCI Machine Learning Repository at the link, [Heart failure clinical records](http://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records). The research article for the data set can be accessed at the link, [Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone](https://www.semanticscholar.org/paper/Machine-learning-can-predict-survival-of-patients-Chicco-Jurman/e64579d8593140396b518682bb3a47ba246684eb) (link to pdf of article [here](https://bmcmedinformdecismak.biomedcentral.com/counter/pdf/10.1186/s12911-020-1023-5.pdf)).

* [Data preparation](#Data-preparation)
* [Data transformation](#Data-transformation)
    * [Modeling with logistic regression, decision trees and random forests](#Modeling-with-logistic-regression,-decision-trees-and-random-forests)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings
%matplotlib inline

## Data preparation

In [2]:
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')
df.rename(columns={'DEATH_EVENT': 'death_event'}, inplace=True)
for col in ['anaemia', 'diabetes', 'high_blood_pressure', 'smoking']:
    df[col].replace(to_replace=[0, 1], value=['No', 'Yes'], inplace=True)
df.sex.replace(to_replace=[0, 1], value=['Female', 'Male'], inplace=True)
df.head().T

Unnamed: 0,0,1,2,3,4
age,75.0,55.0,65.0,50.0,65.0
anaemia,No,No,No,Yes,Yes
creatinine_phosphokinase,582,7861,146,111,160
diabetes,No,No,No,No,Yes
ejection_fraction,20,38,20,20,20
high_blood_pressure,Yes,No,No,No,No
platelets,265000.0,263358.03,162000.0,210000.0,327000.0
serum_creatinine,1.9,1.1,1.3,1.9,2.7
serum_sodium,130,136,129,137,116
sex,Male,Male,Male,Male,Female


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    object 
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    object 
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    object 
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    object 
 10  smoking                   299 non-null    object 
 11  time                      299 non-null    int64  
 12  death_event               299 non-null    int64  
dtypes: float64(3), int64(5), object(5)
memory usage: 30.5+ KB


## Data transformation

In [4]:
X = df.drop('death_event', axis=1)
y = df.death_event

In [5]:
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=0.25, random_state=1)

In [6]:
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_val = X_val.reset_index(drop=True)
y_val = y_val.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [7]:
def data_transformation(X_train, X_val, X_test):
    X_train_dicts = X_train.to_dict(orient='records')
    X_val_dicts = X_val.to_dict(orient='records')
    X_test_dicts = X_test.to_dict(orient='records')
    dv = DictVectorizer(sparse=True)
    X_cat_tr = dv.fit_transform(X_train_dicts)
    feature_names = list(dv.get_feature_names_out())
    X_cat_val = dv.transform(X_val_dicts)
    X_cat_test = dv.transform(X_test_dicts)
    return X_cat_tr, X_cat_val, X_cat_test, feature_names

In [8]:
X_cat_train, X_cat_val, X_test_val, feature_names = data_transformation(X_train, X_val, X_test)

### Modeling with logistic regression, decision trees and random forests

In [9]:
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    lr = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
    rf = RandomForestClassifier(n_estimators=200, min_samples_leaf=5, max_depth=8, random_state=1)
    dt = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=1)
    lr.fit(X_cat_train, y_train)
    dt.fit(X_cat_train, y_train)
    rf.fit(X_cat_train, y_train)
    y_pred_lr = lr.predict_proba(X_cat_val)[:, 1]
    y_pred_dt = dt.predict_proba(X_cat_val)[:, 1]
    y_pred_rf = rf.predict_proba(X_cat_val)[:, 1]
    auc_result_lr = roc_auc_score(y_val, y_pred_lr)
    auc_result_dt = roc_auc_score(y_val, y_pred_dt)
    auc_result_rf = roc_auc_score(y_val, y_pred_rf)
print(f'AUC using logistic regression: {auc_result_lr.round(3):>7}')
print(f'AUC using decision trees: {auc_result_dt.round(3):>12}')
print(f'AUC using random forests: {auc_result_rf.round(3):>12}')

AUC using logistic regression:   0.841
AUC using decision trees:        0.853
AUC using random forests:        0.924
