# Random Forest Model

Random forest model is an ensemble learning method that operates by constructing multiple decision trees during training phase. The decision of the majority of the trees is chosen by the random forest model as the final decision. Random forest model is a popular model because it is easy to use and it can be used for both classification and regression tasks. In this notebook, We will use random forest model to predict the survival of passengers in Titanic dataset.

In [271]:
# Import libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [272]:
# Load the data
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [273]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [274]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [275]:
df.isnull().sum().sort_values(ascending=False)

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [276]:
# df.drop(['deck', 'alive'], axis=1, inplace=True)
# df.age.fillna(df.age.mean(), inplace=True)
# df.embarked.fillna(df.embarked.mode()[0], inplace=True)
# df.embark_town.fillna(df.embark_town.mode()[0], inplace=True)
# df.head()

In [277]:
# df.isnull().sum().sort_values(ascending=False)

In [278]:
# df.info()

In [279]:
# encode the categorical data into numerical data using for loop
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in df.columns:
    if df[col].dtype == 'object' or df[col].dtype.name == 'category':
        df[col] = le.fit_transform(df[col])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [280]:
# train test split
X = df.drop('tip', axis=1)
y = df['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [281]:
# from imblearn.over_sampling import SMOTE
# smote = SMOTE()
# X_resampled, y_resampled = smote.fit_resample(X, y)

In [282]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    int64  
 3   smoker      244 non-null    int64  
 4   day         244 non-null    int64  
 5   time        244 non-null    int64  
 6   size        244 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 13.5 KB


In [283]:
# Random Forest Classifier with related parameters
model = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [284]:
# evaluation of the model
print('Mean absoulte error:', np.mean(np.abs(y_test - y_pred)))
print('Mean squared error:', np.mean((y_test - y_pred)**2))
print('Root mean squared error:', np.sqrt(np.mean((y_test - y_pred)**2)))

Mean absoulte error: 0.7005315584594033
Mean squared error: 0.7898683621863786
Root mean squared error: 0.8887453865907707
