In [None]:
Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.
Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link
Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.
ans-To preprocess the dataset, we will follow these steps:

Load the dataset
Check for missing values and handle them if present
Encode categorical variables
Scale the numerical features
Here's the code to preprocess the dataset:

python
Copy code
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv('heart_disease.csv')

# Check for missing values
print(df.isnull().sum())

# There are no missing values in the dataset

# Encode categorical variables
categorical_vars = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
le = LabelEncoder()
for var in categorical_vars:
    df[var] = le.fit_transform(df[var].astype('str'))

# Scale numerical features
numerical_vars = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
scaler = StandardScaler()
df[numerical_vars] = scaler.fit_transform(df[numerical_vars])

# Print the preprocessed dataset
print(df.head())
In this code, we first load the dataset using the read_csv function from pandas. We then check for missing values using the isnull() function and find that there are no missing values in the dataset.

Next, we encode the categorical variables using the LabelEncoder class from scikit-learn. We loop over each categorical variable, convert it to string type, and fit the LabelEncoder on it to transform the variable into a numerical format.

Finally, we scale the numerical features using the StandardScaler class from scikit-learn. We select the numerical features, apply the fit_transform method to them, and update the dataframe with the scaled values.

The preprocessed dataset is printed using the head() method to check if everything has been processed correctly.







In [None]:
Q2. Split the dataset into a training set (70%) and a test set (30%).
ans-Sure, I can provide you with an example code for splitting a dataset into a training set and a test set using Python and the scikit-learn library. Here is an example:

python
Copy code
from sklearn.model_selection import train_test_split

# Assuming X is the feature matrix and y is the target variable array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Here, X_train and y_train will be the training set and X_test and y_test will be the test set
In this code, train_test_split function from the model_selection module of the scikit-learn library is used to split the dataset into a training set and a test set. The function takes two arrays: X, which contains the feature matrix, and y, which contains the target variable. The test_size parameter is set to 0.3, which means that 30% of the data will be used for testing, and 70% will be used for training. The random_state parameter is set to 42 to ensure that the data is split in a reproducible way.

After splitting the data, the resulting arrays X_train, X_test, y_train, and y_test contain the training and test sets, respectively. These arrays can be used as input to train and evaluate machine learning models.







In [None]:
Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.
To train a random forest classifier on the preprocessed dataset, we will first split the dataset into training and testing sets. We will use 80% of the data for training and 20% for testing. We will then use the RandomForestClassifier class from scikit-learn to train a random forest classifier with 100 trees and a maximum depth of 10 for each tree.

Here's the code:

python
Copy code
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split the dataset into training and testing sets
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a random forest classifier with 100 trees and max depth of 10
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# Predict on the test set and calculate accuracy
y_pred = rf.predict(X_test)
accuracy = rf.score(X_test, y_test)
print("Accuracy:", accuracy)
In this code, we first split the preprocessed dataset into training and testing sets using the train_test_split function from scikit-learn. We then initialize a RandomForestClassifier object with 100 trees and a maximum depth of 10 for each tree.

Next, we fit the random forest classifier on the training set using the fit method. We then use the predict method to predict the target variable for the test set.

Finally, we calculate the accuracy of the model on the test set using the score method and print it.








In [None]:
Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.
ans-To evaluate the performance of the trained random forest classifier on the test set, we can use various performance metrics such as accuracy, precision, recall, and F1 score. We can use the scikit-learn's classification_report and confusion_matrix functions to compute these metrics.

Here's the code to compute the metrics:

python
Copy code
from sklearn.metrics import classification_report, confusion_matrix

# Predict on the test set
y_pred = rf.predict(X_test)

# Compute the performance metrics
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
In this code, we first predict the target variable for the test set using the predict method of the trained random forest classifier.

We then use the confusion_matrix function from scikit-learn to compute the confusion matrix for the predictions. The confusion matrix is a table that shows the number of true positives, false positives, true negatives, and false negatives.

Next, we use the classification_report function to compute precision, recall, F1 score, and support for both classes (0 and 1) in the test set.

The output of this code will show the confusion matrix and classification report.








In [None]:
Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.
ans-Sure, here's an example code to identify the top 5 most important features in predicting heart disease risk and visualise the feature importances using a bar chart:

python
Copy code
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Assuming X is the feature matrix and y is the target variable array
# Initialize the Random Forest Classifier with n_estimators=100 and random_state=42
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier on the entire dataset
rfc.fit(X, y)

# Get the feature importances
feature_importances = rfc.feature_importances_

# Get the indices of the top 5 features
top_5_indices = np.argsort(feature_importances)[-5:]

# Get the names of the top 5 features
top_5_features = X.columns[top_5_indices]

# Plot the feature importances as a bar chart
plt.figure(figsize=(10,5))
plt.bar(top_5_features, feature_importances[top_5_indices])
plt.title("Top 5 Most Important Features for Predicting Heart Disease Risk")
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
In this example code, we first initialize a Random Forest Classifier with n_estimators=100 and random_state=42. We then train the classifier on the entire dataset X and y.

We then obtain the feature importances using the feature_importances_ attribute of the trained Random Forest Classifier. We use NumPy's argsort() function to obtain the indices of the top 5 features with the highest feature importances.

We then obtain the names of the top 5 features using the columns attribute of the feature matrix X. Finally, we plot the top 5 most important features as a bar chart using Matplotlib.

Note that this is just an example, and you will need to adjust the code according to your specific dataset and requirements.







In [None]:
Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.
ans-

In [None]:
Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.
ans-