<a href="https://colab.research.google.com/github/cbarron100/Neural-Networks/blob/main/CardiovacularClassificationTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to create a classification model that will predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.

In [46]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from sklearn.metrics import f1_score, classification_report
from tensorflow.keras.utils import to_categorical
import numpy as np

Import the data and let's take a look at the structure.

After taking a look there seems to be some encoding needed to be done on several columns. They can be seen in the encode_col variable.

We will also make our target and feature labels here.

In [16]:
data = pd.read_csv('/content/cardiovascular_data.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299 entries, 0 to 298
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    object 
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    object 
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    object 
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    object 
 10  smoking                   299 non-null    object 
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
 13  death_event               299 non-null    object 
dtypes: float64

In [28]:

print(Counter(data['DEATH_EVENT']))


X = data[['age',
          'anaemia',
          'creatinine_phosphokinase',
          'diabetes',
          'ejection_fraction',
          'high_blood_pressure',
          'platelets',
          'serum_creatinine',
          'serum_sodium',
          'sex',
          'smoking',
          'time']]

y = data['death_event']

Counter({0: 203, 1: 96})


We now have out feature and label columns. There are some columns that have strings so we will have to one-hot encode them.

In [29]:
object_col = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']

X = pd.get_dummies(X)


Now that we have numerical data in all of the columns, we can split the data. We are also going to change the scales of the data so that they all have an equal impact on the final prediction

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

ct = ColumnTransformer([('numeric', StandardScaler(), ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])
X_train = ct.fit_transform(X_train)
X_test = ct.fit_transform(X_test)

Initialising a LabelEncoder so that the label values are either a one or a 0 depending on if the person had cardiovascular disease or not

In [37]:
le = LabelEncoder()
y_train = le.fit_transform(y_train.astype(str))
y_test = le.fit_transform(y_test.astype(str))

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


Now that the data is prepared we can create a model. We will have an input layer with the corresponding number of features. Then the next layer will have 12 node with a ReLU activation and then finishing wiht 2 outputs with a softmax activation so that we get a percentage chance of which it could be.


In [44]:
model = Sequential()

model.add(InputLayer(X_train.shape[1], ))
model.add(Dense(12, activation = 'relu'))
model.add(Dense(y_train.shape[1], activation = 'softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])




Now we are going to train the model with out data. Then we are also going to present the metric of the model and see how well it could predict the CVD.

In [50]:
num_epochs = 100
batch = 16

model.fit(X_train, y_train, epochs= num_epochs, batch_size = batch, verbose = 0)
loss, acc = model.evaluate(X_test, y_test)

y_estimate = model.predict(X_test)
y_estimate = np.argmax(y_estimate, axis = 1)
y_true = np.argmax(y_test, axis = 1)

print(f'Model loss: {loss}\nModel accuracy: {acc}')
print('Model F1 Score:', f1_score(y_true, y_estimate))
print('Model Classification Report: \n', classification_report(y_true, y_estimate))

Model loss: 0.7557295560836792
Model accuracy: 0.6833333373069763
Model F1 Score: 0.5581395348837209
Model Classification Report: 
               precision    recall  f1-score   support

           0       0.69      0.83      0.75        35
           1       0.67      0.48      0.56        25

    accuracy                           0.68        60
   macro avg       0.68      0.65      0.66        60
weighted avg       0.68      0.68      0.67        60

