# Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised learning category. It is an ensemble method, which means it combines multiple algorithms to give a more accurate and stable prediction. Specifically, Random Forest is a combination of decision trees, where each tree in the forest is built using a bagging method.

The Random Forest algorithm has several advantages. First, it can be used for both classification and regression problems. Second, it adds additional randomness to the model, which results in a wide diversity that generally leads to a better model. Third, it is robust to outliers and does not require variables to be normalized. Fourth, it has built-in feature importance, which helps to identify the most important features in the data.






In [9]:
# import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [10]:
# load the data
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [11]:
# encode the features which are categorical or object using for loop
le = LabelEncoder()
for i in df.columns:
    if df[i].dtype == 'object' or df[i].dtype == 'category':
        df[i] = le.fit_transform(df[i])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [12]:
# split the data into X and y for classification
X = df.drop('sex',axis = 1)
y = df['sex']
# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create train and predict the model
model = RandomForestClassifier(n_estimators=200,random_state=42,criterion='entropy',max_depth=80)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# evaluate the model
print('accuracy_score', accuracy_score(y_test, y_pred))
print('confusion_matrix', confusion_matrix(y_test, y_pred))
print('classification_report', classification_report(y_test, y_pred))

accuracy_score 0.6122448979591837
confusion_matrix [[ 7 12]
 [ 7 23]]
classification_report               precision    recall  f1-score   support

           0       0.50      0.37      0.42        19
           1       0.66      0.77      0.71        30

    accuracy                           0.61        49
   macro avg       0.58      0.57      0.57        49
weighted avg       0.60      0.61      0.60        49



In [13]:
# use random forest for regression task
from sklearn.metrics import r2_score
X = df.drop('tip',axis = 1)
y = df['tip']
# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create train and predict the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# evaluate the model
print('mean_squared_error',mean_squared_error(y_test,y_pred))
print('r2_score', r2_score(y_test, y_pred))
print('mean_absolute_error',mean_absolute_error(y_test,y_pred))


mean_squared_error 0.9625607446938791
r2_score 0.2299337514142753
mean_absolute_error 0.7750510204081635
