# Random forest algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. As the name suggests, **"Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset."**

## Importing and loading data

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Importing random forest classifier 
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics 
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Reading the data
data = pd.read_csv('datasets/data_cleaned.csv')

# Check the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


## Model building

### Separating independent and dependent variables

In [4]:
# Independent variables
x = data.drop(['Survived'], axis=1)

# Dependent variable
y = data['Survived']

print(x.shape, y.shape)

(891, 24) (891,)


### Creating the training and testing sets

In [5]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 101, stratify = y)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(668, 24) (668,)
(223, 24) (223,)


### Build the random forest model

In [6]:
# Creating a random forest instance
clf = RandomForestClassifier(random_state = 96)

# Train the model
clf.fit(train_x, train_y)

In [8]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.988
Testing Score: 0.753


In [9]:
# Looking at the feature importance
print('Feature Importances:', clf.feature_importances_)

Feature Importances: [0.23041043 0.23650759 0.02766848 0.01696085 0.0500349  0.13608447
 0.1681588  0.01320989 0.01617595 0.00627351 0.00425904 0.00480762
 0.00078874 0.00262281 0.01760913 0.01218149 0.01192783 0.00146993
 0.00190138 0.00290028 0.00060589 0.01284854 0.00908377 0.01550866]


In [10]:
# Feature importance against each variable
pd.Series(clf.feature_importances_, index = train_x.columns)

Age           0.230410
Fare          0.236508
Pclass_1      0.027668
Pclass_2      0.016961
Pclass_3      0.050035
Sex_female    0.136084
Sex_male      0.168159
SibSp_0       0.013210
SibSp_1       0.016176
SibSp_2       0.006274
SibSp_3       0.004259
SibSp_4       0.004808
SibSp_5       0.000789
SibSp_8       0.002623
Parch_0       0.017609
Parch_1       0.012181
Parch_2       0.011928
Parch_3       0.001470
Parch_4       0.001901
Parch_5       0.002900
Parch_6       0.000606
Embarked_C    0.012849
Embarked_Q    0.009084
Embarked_S    0.015509
dtype: float64