# **Random Forest Algorithm - Classification and Regression tasks**

## **Written by:** Aarish Asif Khan

## **Date:** 18 February 2024

Random Forest is an `ensemble learning method` used for both `classification` and `regression` tasks. 

It operates by constructing a multitude of decision trees during the training phase and outputs the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

# **Classification**

In [67]:
# Import libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import sklearn

# Machine learning libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [68]:
# Load the Dataset
df = sns.load_dataset("tips")

# Print the first 5 rows of the dataset
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [69]:
# Encode the features which are categorical or object using the for loop
le = LabelEncoder()

for i in df.columns:
    if df[i].dtypes == "object" or df[i].dtypes == "category":
        df[i] = le.fit_transform(df[i])
        
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [70]:
# Split the data into X and y for classification
X = df.drop("sex", axis=1)
y = df["sex"]

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [71]:
# Build, train and predict the model
model = RandomForestClassifier(n_estimators=200, random_state=42, 
                               max_depth=10, min_samples_split=10, min_samples_leaf=4, max_features='sqrt', bootstrap=True)

# Train the model 
model.fit(X_train, y_train)

# Predict the model
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.6938775510204082
Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.37      0.48        19
           1       0.69      0.90      0.78        30

    accuracy                           0.69        49
   macro avg       0.70      0.63      0.63        49
weighted avg       0.70      0.69      0.67        49

Confusion Matrix:
 [[ 7 12]
 [ 3 27]]


# **Regression**



In [72]:
X = df.drop('tip', axis=1)
y = df['tip']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [73]:
# Build the model
model = RandomForestRegressor(n_estimators=500, random_state=42, max_depth=10, min_samples_split=10)

# Train the model
model.fit(X_train, y_train)

# Predict the model
y_pred = model.predict(X_test)

In [74]:
# Evaluate the model
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 0.7614699965292709
Mean Squared Error: 0.9542648950456666
R2 Score: 0.23657058327413039
Root Mean Squared Error: 0.9768648294649913
