Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.
Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link
Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

Q2. Split the dataset into a training set (70%) and a test set (30%).

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

Answer 1...

Preprocessing the dataset:

First, we need to load the dataset and import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

df = pd.read_csv('heart.csv')


The dataset contains missing values represented by '?'. We will replace these values with NaNs and then drop the rows that contain missing values:

In [None]:
df = df.replace('?', np.nan)
df = df.dropna()


Next, we will encode the categorical variables 'cp', 'restecg', 'slope', and 'thal' using LabelEncoder:

In [None]:
encoder = LabelEncoder()
df['cp'] = encoder.fit_transform(df['cp'])
df['restecg'] = encoder.fit_transform(df['restecg'])
df['slope'] = encoder.fit_transform(df['slope'])
df['thal'] = encoder.fit_transform(df['thal'])


In [None]:
Finally, we will scale the numerical features using StandardScaler:

In [None]:
scaler = StandardScaler()
df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']] = scaler.fit_transform(df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']])


Q2. Split the dataset into a training set (70%) and a test set (30%).

In [None]:
# Answer 2...

# Splitting the dataset:

# We will split the dataset into a training set (70%) and a test set (30%):

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

In [None]:
# Answer 3...

# Training the random forest classifier:

# We will train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each tree:

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rfc.fit(X_train, y_train)


Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

In [None]:
# Answer 4...
# Evaluating the performance of the model:

# We will evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = rfc.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 score:', f1)


Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

In [None]:
# Answer 5...

# Identifying the top 5 most important features:

# # We will use the feature importance scores to identify the top 5 most important features in predicting heart disease risk:



importance_scores = rfc.feature_importances_
feature_importances = pd.DataFrame({'feature': X.columns, 'importance': importance_scores})
feature_importances = feature_importances.sort_values('importance', ascending=False).reset_index(drop=True)

print(feature_importances.head(5))


# We can also visualize the feature importances using a bar chart:

import matplotlib.pyplot as plt

plt.bar(x=feature_importances['feature'], height=feature_importances['importance'])
plt.xticks(rotation=90)
plt.show()



Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

In [None]:
# Answer 6...

# Tuning hyperparameters using GridSearchCV:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rfc = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(rfc, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print(f"Best parameters found: {best_params}")



Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

In [None]:
# Answer 7...

# Report the best set of hyperparameters found by the search and the corresponding performance metrics:

# Using the above code, the best hyperparameters found by GridSearchCV are:

Best parameters found: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


In [4]:
# We can evaluate the performance of the tuned model using these hyperparameters:

In [None]:
from sklearn.metrics import classification_report

tuned_rfc = RandomForestClassifier(random_state=42, **best_params)
tuned_rfc.fit(X_train, y_train)

y_pred_tuned = tuned_rfc.predict(X_test)

print("Classification report for tuned model:\n")
print(classification_report(y_test, y_pred_tuned))


Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

Answer 8...

Interpret the model by analysing the decision boundaries of the random forest classifier:

To plot the decision boundaries of the random forest classifier, we can select two of the most important features and create a scatter plot. We can then train the random forest classifier on the entire dataset and use it to predict the class probabilities for each point in the scatter plot. We can then visualize the decision boundaries by contouring the predicted class probabilities.



In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Select two most important features
X = df.drop(columns=['target']).values
y = df['target'].values
feature_names = df.drop(columns=['target']).columns
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
top_feature_indices = indices[:2]
top_feature_names = feature_names[top_feature_indices]
X = X[:, top_feature_indices]

# Train random forest classifier on entire dataset
rfc.fit(X, y)

# Create grid of points for scatter plot
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = rfc.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

# Plot decision boundaries and scatter plot
plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu_r, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdBu_r, edgecolors='k')
plt.xlabel(top_feature_names[0])
plt.ylabel(top_feature_names[1])
plt.title("Decision boundaries of random forest classifier")
plt.show()


The resulting plot will show the decision boundaries of the random forest classifier on the scatter plot of the two most important features.