# RANDOM FOREST ALGORITHM:
* Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. 
* It can be used for both Classification and Regression problems in ML. 
* It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex       problem and to improve the performance of the model.
* As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset."
* Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
* The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.



The below diagram explains the working of the Random Forest algorithm:
<img src="Random Forest.jpg">

# Working of Random Forest Algorithm
### Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.

* Step-1: Select random K data points from the training set.

* Step-2: Build the decision trees associated with the selected data points (Subsets).

* Step-3: Choose the number N for decision trees that you want to build.

* Step-4: Repeat Step 1 & 2.

* Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

# Advantages of Random Forest:
 
* Random Forest is capable of performing both Classification and Regression tasks.
* It is capable of handling large datasets with high dimensionality.
* It enhances the accuracy of the model and prevents the overfitting issue.

# Disadvantages of Random Forest Algorithm:
* Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks.

# Implementation of Random Forest Algorithm

In [33]:
#Impoting Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

#Loading the dataset
data = pd.read_csv('New heart.csv')
np.shape(data)


(918, 12)

In [34]:
# print first 5 rows of the data
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [35]:
# print last 5 rows of the data
data.tail()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
917,38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [36]:
# getting some info about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [37]:
# checking for missing values
data.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

In [38]:
data[data.duplicated()]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease


In [39]:
len(data[data.duplicated()])

0

In [40]:
# statistical measure of data
data.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [41]:
# Handle missing values (Replace 'M' with NaN)
data.fillna('missing_value',inplace=True)

# Drop rows with missing target values (assuming 'HeartDisease' is the column name)
data=data.dropna(subset=['HeartDisease'])

In [42]:
# Define features (X) and target (y)
X = data.drop('HeartDisease', axis=1)
y = data['HeartDisease']

# Convert non-numeric columns to numeric using pandas get_dummies
X = pd.get_dummies(X, columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])

#data
X


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex_F,Sex_M,ChestPainType_ASY,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,1,0,1,0,0,0,1,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,0,0,0,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,1,0,1,0,0,0,0,1,1,0,0,0,1
3,48,138,214,0,108,1.5,1,0,1,0,0,0,0,1,0,0,1,0,1,0
4,54,150,195,0,122,0.0,0,1,0,0,1,0,0,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,110,264,0,132,1.2,0,1,0,0,0,1,0,1,0,1,0,0,1,0
914,68,144,193,1,141,3.4,0,1,1,0,0,0,0,1,0,1,0,0,1,0
915,57,130,131,0,115,1.2,0,1,1,0,0,0,0,1,0,0,1,0,1,0
916,57,130,236,0,174,0.0,1,0,0,1,0,0,1,0,0,1,0,0,1,0


In [43]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



### MODEL TRAINING

In [69]:
import warnings
# Build the Random Forest model
warnings.filterwarnings("ignore", category=UserWarning)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)



In [70]:
# Make predictions on the test set
y_pred = model.predict(X_test)



In [71]:
# Calculate evaluation metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Precision:", precision*100)
print("Recall:", recall*100)
print("Accuracy:", accuracy*100)
print("F1 Score:", f1*100)


Precision: 90.47619047619048
Recall: 88.78504672897196
Accuracy: 88.04347826086956
F1 Score: 89.62264150943396


#### ACCURACY

In [72]:
#model evaluation
#accuracy score on train data
X_train_prediction =  model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [73]:
print('Acurracy on training data : ', training_data_accuracy*100)

Acurracy on training data :  100.0


In [74]:
#accuracy score on test data
X_test_prediction =  model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)

In [75]:
print('Test data accuracy : ', test_data_accuracy*100)

Test data accuracy :  88.04347826086956


#### CONFUSION MATRIX

In [76]:
from sklearn.metrics import confusion_matrix

In [77]:
z = confusion_matrix(y_test,X_test_prediction)
print(z)

[[67 10]
 [12 95]]


#### PRECISION

Precision is the ratio of number of True Positive to the total number of Predicted Positive. It measures, out of the total predicted positive, how many are actually positive.

In [78]:
from sklearn.metrics import precision_score

In [79]:
# precision for training data predictions
precision_train = precision_score(y_train, X_train_prediction)
print('Training data Precision =', precision_train*100)

Training data Precision = 100.0


In [80]:
# precision for test data predictions
precision_test = precision_score(y_test, X_test_prediction)
print('Test data Precision =', precision_test*100)

Test data Precision = 90.47619047619048


#### RECALL

Recall is the ratio of number of True Positive to the total number of Actual Positive. It measures, out of the total actual positive, how many are predicted as True Positive.

In [81]:
from sklearn.metrics import recall_score

In [82]:
# recall for training data predictions
recall_train = recall_score(y_train, X_train_prediction)
print('Training data Recall =', recall_train*100)

Training data Recall = 100.0


In [83]:
# recall for test data predictions
recall_test = recall_score(y_test, X_test_prediction)
print('Test data Recall =', recall_test*100)

Test data Recall = 88.78504672897196


#### F1 SCORE

F1 Score is an important evaluation metric for binary classification that combines Precision & Recall. F1 Score is the harmonic mean of Precision & Recall.

In [84]:
from sklearn.metrics import f1_score

In [85]:
# F1 score for training data predictions
f1_score_train = f1_score(y_train, X_train_prediction)
print('Training data F1 Score =', f1_score_train*100)

Training data F1 Score = 100.0


In [86]:
# F1 Score for test data predictions
f1_score_test = recall_score(y_test, X_test_prediction)
print('Test data F1 Score =', f1_score_test*100)

Test data F1 Score = 88.78504672897196


### BUILDING A PREDICTIVE SYSTEM

In [87]:
input_data = (50,170,245,1,150,1.0,1,0,1,0,0,0,0,1,0,1,0,1,0,0)

# change the input data to a numpy array
input_data_as_numpy_array= np.asarray(input_data)

# reshape the numpy array as we are predicting for only on instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if (prediction[0]== 0):
  print('The Person does not have a Heart Disease')
else:
  print('The Person has Heart Disease')

[1]
The Person has Heart Disease
