## Predicting Cardiovascular Disease (CVDs) using TensorFlow

Original dataset= https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

Dataset detail:
1. age = Age
2. anaemia = Decrease of red blood cells or hemoglobin (boolean)
3. creatinine_phosphokinase = Level of the CPK enzyme in the blood (mcg/L)
4. diabetes = If the patient has diabetes (boolean)
5. ejection_fraction = Percentage of blood leaving the heart at each contraction (percentage)
6. high_blood_pressure = If the patient has hypertension (boolean)
7. platelets = Platelets in the blood (kiloplatelets/mL)
8. serum_creatinine = Level of serum creatinine in the blood (mg/dL)
9. serum_sodium = Level of serum sodium in the blood (mEq/L)
10. sex = Woman or man (binary)
11. smoking = If the patient smokes or not (boolean)
12. time = Follow-up period (days)
13. DEATH_EVENT = If the patient deceased during the follow-up period (boolean)


In [82]:
# Import relevant database
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential # For simple model creation with a single input and output
from tensorflow.keras.layers import Dense, InputLayer 
from sklearn.metrics import classification_report

#### 1. Import Data

In [3]:
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')
print(df.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  
0        0     4            1  
1        0     6            1  
2       

In [4]:
# Check duplicate and missing values
print(df.duplicated().sum())
print(df.isnull().sum())


0
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64


In [9]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  death_event               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB
None


In [8]:
df.rename(columns={'DEATH_EVENT':'death_event'}, inplace=True)
print(df.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  death_event  
0        0     4            1  
1        0     6            1  
2       

In [12]:
# Print the distribution of death_event using Counter
death_distribution = Counter(df['death_event'])
print(death_distribution)

Counter({0: 203, 1: 96})


From the death_event distribution, we can see that there are 96 death out of 299 patients.

In [13]:
# Extract the label column 'death_event' and assign it to y
y = df['death_event']

# Extract the features columns and assign it to X
X = df.drop(columns=['death_event'])

#### 2. Preprocessing Data

In [14]:
#  Convert the categorical columns to numerical using pandas get_dummies
X = pd.get_dummies(X)
print(X.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  
0        0     4  
1        0     6  
2        1     7  
3        0     7  
4        

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(239, 12) (60, 12) (239,) (60,)


In [55]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [56]:
# Assign column names to the features dataframes (X_train and X_test)
X_train.columns = X.columns
X_test.columns = X.columns

In [57]:
# Apply StandardScaler to scare the numerical features
ct = ColumnTransformer(transformers = [('numerical', StandardScaler(), ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])

In [58]:
# Fit the transformer to the training data
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

#### 3. Prepare label for classifications

Transform non-numerical labels into numerical (or "encoded" labels)

In [60]:
# Initialize the LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [63]:
# Convert a class vector (integers) to binary class matrix using to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

#### 4. Design the model

In [67]:
# Initialize the model
model = Sequential()

In [69]:
# Create the input layer
model.add(InputLayer(input_shape=(X_train.shape[1],)))

In [70]:
# Create hidden layer using Dense
model.add(Dense(12, activation='relu')) # 12 neurons, relu activation function

In [71]:
# Create output layer using Dense
model.add(Dense(2, activation='softmax')) # 2 neurons, softmax activation function

In [72]:
# Use model.compile to compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#### 5. Train and evaluate the model

In [73]:
model.fit(X_train, y_train, epochs=100, batch_size=16, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2481bcce740>

Analysis of Training Progress:
- Learning Trend: The loss decreases and accuracy increases over epochs, which is a good sign, indicating the model is learning effectively from the training data.
- Speed of Convergence: The model seems to be learning relatively smoothly, given the steady improvement in accuracy and reduction in loss.
- Early Stages: Initially, the accuracy is quite low, suggesting the model starts with relatively poor knowledge about the classification task, which is expected.
- Later Stages: Towards the end, the improvement in loss and accuracy per epoch becomes smaller, indicating the model may be approaching its learning capacity given the current architecture and data.

In [77]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Accuracy: %.2f' % (accuracy*100))
print('loss: %.2f' % (loss*100))

Accuracy: 68.33
loss: 60.20


#### 6. Generate a classification report

In [80]:
y_estimate = model.predict(X_test)

for i in range(10):
    print('Actual:', y_test[i], 'Predicted:', y_estimate[i])

Actual: [1. 0.] Predicted: [0.81400955 0.18599038]
Actual: [1. 0.] Predicted: [0.9967305 0.0032695]
Actual: [0. 1.] Predicted: [0.89943975 0.10056023]
Actual: [0. 1.] Predicted: [0.00875817 0.99124175]
Actual: [1. 0.] Predicted: [0.94671464 0.05328544]
Actual: [1. 0.] Predicted: [0.9930094  0.00699061]
Actual: [0. 1.] Predicted: [0.29900047 0.70099956]
Actual: [1. 0.] Predicted: [0.5475237 0.4524763]
Actual: [0. 1.] Predicted: [0.02328878 0.97671115]
Actual: [1. 0.] Predicted: [0.8561362 0.1438637]


In [81]:
# Select the index of the highest probability using np.argmax
y_estimate = np.argmax(y_estimate, axis=1)
y_true = np.argmax(y_test, axis=1)

In [83]:
# Print the classification report
print(classification_report(y_true, y_estimate))

              precision    recall  f1-score   support

           0       0.69      0.83      0.75        35
           1       0.67      0.48      0.56        25

    accuracy                           0.68        60
   macro avg       0.68      0.65      0.66        60
weighted avg       0.68      0.68      0.67        60



#### Conclusion

1. The model performs better in identifying class 0 than class 1, as evidenced by higher recall and F1-score for class 0.
2. The precision is similar for both classes, indicating a balanced performance in terms of positive prediction accuracy.
3. The overall accuracy of 0.68 suggests that the model correctly predicts the class 68% of the time across the entire dataset.
4. The macro and weighted averages provide a holistic view of the model's performance across the classes.