**<title>Classification</title>**

**<h1>Heart Failure Clinical Records</h1>**

<hr>

*Original dataset reference: https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records# <br>
*Original paper reference: https://doi.org/10.1186/s12911-020-1023-5 <br>
*Statsmodels: https://www.statsmodels.org/stable/index.html

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Data Viz
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.tree import DecisionTreeClassifier

# Metrics Evaluation
from sklearn.metrics import accuracy_score

## **Backgroud**
Case study is motivated by Chicco and Jurman (2020), who proposed a machine learning model for predict survival of patients with heart failure from serum creatine and ejection fraction. The Machine Learning model learn from data that have several features such as physical description, patient health description, patient habits and compounds in the patient's blood. The study of Harrison and Rubinfeld (1978) employed data from f 299 patients with heart failure collected in 2015. Has 12 features and 1 target

`**Objective`
1. **Make DecisionTree model for classify patient survivability**
2. **Split data into traing and testing 80:20**
3. **Drop missing values (if any)**
4. **Tune the parameter and obtain the best Decision Tree model**
5. **Compare the result with other model (KNN & Logistic Regression)**

## **Dataset Description**

- Dependent variable: 
    - death event (If the patient died during the follow-up period).
- Patient physical description:
    - Age
    - sex 
- Patient health description:
    - Anaemia ( Decrease of red blood cells or hemoglobin yes/no)
    - High blood pressure ( If a patient has hypertension yes/no)
    - Diabetes (If the patient has diabetes yes/no)
    - Ejection fraction (Percentage of blood leaving Percentage [14, ..., 80] the heart at each contraction)
- Patient habbit:
    - Smoking (If the patient smokes yes/no)
- Blood's compound:
    - Creatinine phosphokinase (Level of the CPK enzyme in the blood mcg/L [23, ..., 7861])
    - Platelets (Platelets in the blood kiloplatelets/mL)
    - Serum creatinine (Level of creatinine in the blood mg/dL)
    - Serum sodium (Level of sodium in the blood mEq/L)
- Other:
    - Time (Follow-up period Days)

In [2]:
df = pd.read_csv("heart_failure_clinical_records_dataset.csv")

df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [3]:
from sklearn.model_selection import train_test_split

feature = df[['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium']]
label = df['DEATH_EVENT']
xtrain,xtest, ytrain, ytest = train_test_split(feature,label,test_size=0.2,random_state=100,stratify=label)

In [4]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

minmax = MinMaxScaler()
stdsclr = StandardScaler()
robstsclr = RobustScaler()

In [None]:
minmax.fit(x)
x_minmax = minmax.transform(x)
x_minmax = pd.DataFrame(x_minmax, columns=['minmax tip','minmax total_bill'])
display(x.head())
display(x_minmax.head())