Propmt: Can you analyse the "Heart Disease Cleveland" dataset and build a heart disease prediction model using deep neural networks using TF & Keras?


In [1]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
data = pd.read_csv(url, header=None, names=column_names, na_values="?")

In [2]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Impute missing values
imputer = SimpleImputer()
data = pd.DataFrame(imputer.fit_transform(data), columns=column_names)

# Normalize the features
scaler = StandardScaler()
X = data.drop('num', axis=1)
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

y = data['num']

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential([
    Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2)

In [6]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.5410


GPT-4: Note that the Heart Disease Cleveland dataset is relatively small, and deep learning models might not outperform traditional machine learning algorithms such as logistic regression or random forests. It's a good idea to compare the performance of different models and choose the one that works best for this specific problem.

ME: The Accuracy = 0.5410 and it looks very low for the model you generated. Can you tune the model pls using Keras tunner?

In [None]:
!pip install keras-tuner

In [8]:
from kerastuner import RandomSearch
from tensorflow.keras import layers

def build_model(hp):
    model = tf.keras.Sequential()
    model.add(layers.Dense(units=hp.Int('units_1', min_value=32, max_value=256, step=32),
                           activation='relu', input_shape=(X_train.shape[1],)))
    model.add(layers.Dropout(rate=hp.Float('dropout_1', min_value=0.1, max_value=0.5, step=0.1)))
    
    model.add(layers.Dense(units=hp.Int('units_2', min_value=16, max_value=128, step=16),
                           activation='relu'))
    model.add(layers.Dropout(rate=hp.Float('dropout_2', min_value=0.1, max_value=0.5, step=0.1)))
    
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(hp.Float('learning_rate', min_value=1e-5, max_value=1e-2, sampling='LOG')),
        loss='binary_crossentropy',
        metrics=['accuracy'])

    return model

  from kerastuner import RandomSearch


In [9]:
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=50,
    executions_per_trial=3,
    directory='my_dir',
    project_name='helloworld')

tuner.search_space_summary()

tuner.search(X_train, y_train, epochs=50, validation_split=0.2)

Trial 50 Complete [00h 00m 16s]
val_accuracy: 0.5170067946116129

Best val_accuracy So Far: 0.557823121547699
Total elapsed time: 00h 14m 42s


In [10]:
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units_1')} and the optimal rate of dropout is {best_hps.get('dropout_1')}.
The optimal number of units in the second densely-connected layer is {best_hps.get('units_2')} and
the optimal rate of dropout is {best_hps.get('dropout_2')}. The optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

best_model = tuner.get_best_models(num_models=1)[0]


The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is 64 and the optimal rate of dropout is 0.5.
The optimal number of units in the second densely-connected layer is 16 and
the optimal rate of dropout is 0.30000000000000004. The optimal learning rate for the optimizer
is 4.144107425040524e-05.



In [11]:
history = best_model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [12]:
loss, accuracy = best_model.evaluate(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.5246


Me : Oh well, the accuracy after hyperparameter tuning is worse than before. Can you generate me code using random forests pls?

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [19]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

In [20]:
y_pred = rf_clf.predict(X_test)

In [21]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.4918


Me: Oh well, the accuracy is much worse in this one which is 0.4918. Can you select a suitable machine learning model and generate code and get me to an accuracy > .9?

GPT-4: It's important to note that achieving an accuracy of over 0.9 might not be feasible for every dataset. However, I can suggest trying a different machine learning model, such as the Support Vector Machine (SVM) with a radial basis function (RBF) kernel, and see if it improves the performance. Here's how you can do that.

In [22]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [23]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001]
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

Best hyperparameters: {'C': 1, 'gamma': 0.1}


In [24]:
svm_clf = SVC(kernel='rbf', C=best_params['C'], gamma=best_params['gamma'])
svm_clf.fit(X_train, y_train)

In [25]:
y_pred = svm_clf.predict(X_test)

In [26]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.5410


In [27]:
from sklearn.metrics import classification_report

In [28]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.68      0.97      0.80        29
         1.0       0.43      0.25      0.32        12
         2.0       0.25      0.22      0.24         9
         3.0       0.00      0.00      0.00         7
         4.0       0.00      0.00      0.00         4

    accuracy                           0.54        61
   macro avg       0.27      0.29      0.27        61
weighted avg       0.45      0.54      0.48        61



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
