Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.

In [96]:
import pandas as pd
import numpy as np

In [97]:
df=pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [99]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

Q2. Split the dataset into a training set (70%) and a test set (30%).

In [100]:
from sklearn.model_selection import train_test_split

In [101]:
X=df.drop('target',axis=1)
y=df['target']

In [102]:
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=42)

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

In [103]:
from sklearn.ensemble import RandomForestClassifier

In [104]:
random_clf=RandomForestClassifier(n_estimators=100,max_depth=10)
random_clf.fit(X_train,y_train)

In [106]:
y_pred=random_clf.predict(X_test)

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

In [108]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

In [109]:
accuracy_score(y_test,y_pred),precision_score(y_test,y_pred),recall_score(y_test,y_pred),f1_score(y_test,y_pred)

(0.8131868131868132, 0.8367346938775511, 0.82, 0.8282828282828283)

Q5. Use the feature importance scores to identify the top 5 most important

In [110]:
feature_names=random_clf.feature_names_in_

In [111]:
feature_importance=random_clf.feature_importances_

In [112]:
top_5_indices=np.argsort(feature_importance)[-5:][::-1]
top_5_indices

array([11,  7, 12,  9,  2])

In [113]:
top_5_features=[]
for i in top_5_indices:
    top_5_features.append(feature_names[i])
print(f'Top 5 most important features are: {top_5_features}')

Top 5 most important features are: ['ca', 'thalach', 'thal', 'oldpeak', 'cp']


Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

In [114]:
from sklearn.model_selection import GridSearchCV

In [115]:
parameters={'n_estimators':[100,200,300],
            'max_depth':[4,5,6,7],
            'min_samples_split':[2,3,5,7],
            'min_samples_leaf':[1,2,3]    
}

In [42]:
clf=GridSearchCV(RandomForestClassifier(),param_grid=parameters,cv=5,verbose=3,scoring='accuracy')
clf.fit(X_train,y_train)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[CV 1/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.860 total time=   0.2s
[CV 2/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.860 total time=   0.2s
[CV 3/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.690 total time=   0.2s
[CV 4/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.905 total time=   0.2s
[CV 5/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.738 total time=   0.2s
[CV 1/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=200;, score=0.884 total time=   0.4s
[CV 2/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=200;, score=0.837 total time=   0.4s
[CV 3/5] END max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=200;, score=0.738 total time=   0.4s
[

Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

In [43]:
clf.best_params_

{'max_depth': 5,
 'min_samples_leaf': 3,
 'min_samples_split': 3,
 'n_estimators': 200}

In [116]:
random_clf_2=RandomForestClassifier(max_depth=4,min_samples_leaf=3,min_samples_split=7,n_estimators=100)
random_clf_2.fit(X_train,y_train)

In [117]:
y_pred_2=random_clf_2.predict(X_test)

In [119]:
print(f"Accuracy before tuning tha parameters: {accuracy_score(y_test,y_pred)}")
print(f"Accuracy after tuning tha parameters: {accuracy_score(y_test,y_pred_2)}")

Accuracy before tuning tha parameters: 0.8131868131868132
Accuracy after tuning tha parameters: 0.8351648351648352
