## Predicting Cardiovascular Disease (CVDs) using TensorFlow

Original dataset= https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

Dataset detail:
1. age = Age
2. anaemia = Decrease of red blood cells or hemoglobin (boolean)
3. creatinine_phosphokinase = Level of the CPK enzyme in the blood (mcg/L)
4. diabetes = If the patient has diabetes (boolean)
5. ejection_fraction = Percentage of blood leaving the heart at each contraction (percentage)
6. high_blood_pressure = If the patient has hypertension (boolean)
7. platelets = Platelets in the blood (kiloplatelets/mL)
8. serum_creatinine = Level of serum creatinine in the blood (mg/dL)
9. serum_sodium = Level of serum sodium in the blood (mEq/L)
10. sex = Woman or man (binary)
11. smoking = If the patient smokes or not (boolean)
12. time = Follow-up period (days)
13. DEATH_EVENT = If the patient deceased during the follow-up period (boolean)


In [23]:
# Import relevant database
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
import numpy as np


#### 1. Import Data

In [3]:
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')
print(df.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  
0        0     4            1  
1        0     6            1  
2       

In [4]:
# Check duplicate and missing values
print(df.duplicated().sum())
print(df.isnull().sum())


0
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64


In [9]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  death_event               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB
None


In [8]:
df.rename(columns={'DEATH_EVENT':'death_event'}, inplace=True)
print(df.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  death_event  
0        0     4            1  
1        0     6            1  
2       

In [12]:
# Print the distribution of death_event using Counter
death_distribution = Counter(df['death_event'])
print(death_distribution)

Counter({0: 203, 1: 96})


From the death_event distribution, we can see that there are 96 death out of 299 patients.

In [13]:
# Extract the label column 'death_event' and assign it to y
y = df['death_event']

# Extract the features columns and assign it to X
X = df.drop(columns=['death_event'])

#### 2. Preprocessing Data

In [14]:
#  Convert the categorical columns to numerical using pandas get_dummies
X = pd.get_dummies(X)
print(X.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  
0        0     4  
1        0     6  
2        1     7  
3        0     7  
4        

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [32]:
X_train = pd.DataFrame(X_train)
print(type(X_train))


<class 'pandas.core.frame.DataFrame'>


In [34]:
# Apply StandardScaler to scare the numerical features
ct = ColumnTransformer(transformers = [('numerical', StandardScaler(), ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])

In [35]:
# Fit the transformer to the training data
X_train = ct.fit_transform(X_train)

ValueError: A given column is not a column of the dataframe