Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.
Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data_url = "https://drive.google.com/uc?id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ"
heart_df = pd.read_csv(data_url)

Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

In [9]:
# Handle missing values
heart_df.dropna(inplace=True)

# Encoding categorical variables (if any)
# Assuming the dataset doesn't have categorical variables that need encoding.

# Feature scaling (if necessary)
# No feature scaling is necessary for Random Forest.


NameError: name 'heart_df' is not defined

Q2. Split the dataset into a training set (70%) and a test set (30%).

In [10]:
X = heart_df.drop('target', axis=1)
y = heart_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


NameError: name 'heart_df' is not defined

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

In [11]:
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_classifier.fit(X_train, y_train)

NameError: name 'RandomForestClassifier' is not defined

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

In [13]:
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", conf_matrix)

NameError: name 'rf_classifier' is not defined

Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

In [14]:
feature_importances = rf_classifier.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1]
top_features = X.columns[sorted_indices][:5]

# Visualizing feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances[sorted_indices][:5], y=top_features)
plt.title("Top 5 Most Important Features")
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.show()

NameError: name 'rf_classifier' is not defined

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

In [15]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

NameError: name 'GridSearchCV' is not defined

Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

In [16]:
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
best_score = grid_search.best_score_

print("Best Hyperparameters:", best_params)
print("Best Cross-Validation Accuracy:", best_score)

# Evaluate best model on test set
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best)
recall_best = recall_score(y_test, y_pred_best)
f1_best = f1_score(y_test, y_pred_best)

print("\nBest Model Performance on Test Set:")
print("Accuracy:", accuracy_best)
print("Precision:", precision_best)
print("Recall:", recall_best)
print("F1 Score:", f1_best)


NameError: name 'grid_search' is not defined

Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

In [17]:
# Since we have multiple features, let's pick the top two important features
top_2_features = top_features[:2]
X_top_2 = X[top_2_features]

# Plot decision boundaries
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_top_2.iloc[:, 0], y=X_top_2.iloc[:, 1], hue=y, palette='Set1')
plt.xlabel(top_2_features[0])
plt.ylabel(top_2_features[1])
plt.title("Decision Boundaries of Random Forest Classifier")
plt.legend(title='Target', loc='upper right')
plt.show()

NameError: name 'top_features' is not defined