# Random Forest Methodology

##### Author information
- Name: Rubiga Kim
- email address: 
- GitHub: 

#### Part 1. Brief Background of Methodology 

Decision trees were widely utilized for machine learning problems before the Random Forest approach was introduced. Decision trees are straightforward, intuitive models that divide the feature space recursively according to a set of conditions. However, there are several drawbacks to decision trees that led to the creation of Random Forest. Overfitting is a major issue with decision trees. Decision trees overfit when they grow overly complicated and stick too closely to the training data, which leads to poor generalization to new data. They frequently have a large variation and are sensitive to noise or outliers. As a result, they aren't as capable to see the underlying patterns in the data and are more likely to mistake when dealing with novel, untested samples.

The instability of decision trees is another drawback. They are sensitive to data changes since a slight change in the training data might result in a drastically different decision tree structure. When generating forecasts or evaluating the significance of various characteristics, this instability may be a difficulty. Due to their tendency to build excessively complicated trees or to concentrate on some attributes more than others, decision trees frequently exhibit bias. This bias may result in inaccurate predictions or analyses of the data. Given these restrictions, a method that could deal with the bias, instability, and overfitting problems that decision trees had became necessary. As a result, Random Forest was created.

With the help of the ensemble learning technique Random Forest, predictions from many decision trees may be combined to produce predictions that are more reliable and accurate. Both the selection of training samples (bootstrapping) and the choice of characteristics used for each decision tree are subjected to randomization. Random Forest reduces overfitting and stabilizes the predictions by training numerous decision trees on various subsets of the data and characteristics.
By taking into account a wider range of attributes while creating decision trees in Random Forest, the bias problem with decision trees is also addressed. The randomization added to the feature selection process aids in preventing the dominance of particular characteristics and promotes a thorough investigation of the feature space.

##### Applications 
There are many applications and sectors where Random Forest may be employed. The following are two main domains where Random Forest is applicable: classification problems and regression problems. Problems with classification: Random Forest is frequently used to solve classification problems. Both binary and multiple-class classification issues are supported. Examples include the identification of spam emails, sentiment analysis, the diagnosis of illnesses, the prediction of customer turnover, and picture categorization. Random Forest may also be used to address problems with regression in which the objective is to forecast a continuous target variable. Complex interactions between features and targets can be handled by it. Examples include stock market research, demand forecasting, and property price prediction.

##### Methodology
<img src ='https://www.tibco.com/sites/tibco/files/media_entity/2021-05/random-forest-diagram.svg'>
Flowchart-like structure known as a decision tree, where each internal node stands in for a feature or attribute, each branch for a decision rule, and each leaf node for the result or class label. Using the feature values, a decision tree divides the input space into regions. It asks, “What attribute will enable me to divide the current observations into groups that are as distinct from one another as feasible (and whose members are as similar to one another as possible)?”

Then, bagging, also known as bootstrap integration, allows individual decision trees to generate very different results by randomly sampling from a dataset and replacing data, meaning that each tree uses only a subset of the data instead of including all of the available data. These individual trees then make decisions based on the data they have and predict outcomes based only on these data points.This means that each random forest has trees that are trained on different data and use different features to make decisions. After, the class with the most votes become the prediction made by our model from each individual tree in the random forest. 

Depending on whether you are using classification problems or regression problems, the Index or the formula used to determine how nodes on a decision tree branch are ordered. When solving regression problems, the MSE(mean squared error) is utilized. 

$$
  MSE = \frac{1}{N} \sum \limits _{i=1} ^{N} (fi-yi)^2
$$

For the N number of datapoints, fi is the predicted value returned by the model and yi is the actual value for datapoint i. This formula evaluates the distance between each node and the expected actual value. 

However, when using random forests on classification data, the Gini index or the entropy formula is utilized. 

$$ Gini = 1- \sum \limits _{i=1} ^{C} (p_{i} )^2 $$
 
The formula above calculates the Gini of each branch on a node based on the class and probability, indicating which branch is more likely to occur. Pi stands for the class’s relative frequency in the dataset and c stand for the total number of cases. 

$$ Entropy = \sum \limits _{i=1} ^{C} - p_{i} * log_{2}(p_{i}) $$ 

Entropy examines the likelihood of a particular result to determine which branch the node should take. It is more mathematically complex than the Gini index since a logarithmic function is utilized to calculate it.


##### Strength
The strength of this algorithm is as follows. There is a low correlation between models. Uncorrelated models have the ability to provide ensemble forecasts that are more accurate than any of the individual predictions. As long as they don't consistently all mistake in the same direction, the trees shield each other from their individual faults, which accounts for this result. Many trees will be right while some may be wrong, allowing the group of trees to travel in the proper direction. The following conditions must be met for random forest to function effectively. This allows the methodologies to be robust to outliers and noisy datapoints, and overfitting, as they are unlikely to consistently affect the construction of multiple decision trees, and the randomness in the selection of features helps create diversity in the forest. This also has an advantage of improving the accuracy of the model. 

Also, decision trees are sensitive to the data they are trained on, and even little adjustments to the training set can produce dramatically different tree architectures.This problem may be solve with the bagging procedure, where each individual tree produces its output from the different set of datasets.

Finally, random forests offer a measure of feature relevance, enabling researchers to pinpoint the features that will have the most impact on the prediction job. This data can help with feature selection, locating relevant variables, and getting a better understanding of the underlying patterns in the data.They are also capable of capturing complex nonlinear relationships. 



#### Example

This example aims to train a random forest classfier model on the  Titanic dataset to predict whether the passengers on the ship survived or not on the test data.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler


In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
train["Age"].fillna(train["Age"].median(skipna=True), inplace=True)
test["Age"].fillna(test["Age"].median(skipna=True), inplace=True)
train["Embarked"].fillna(train['Embarked'].value_counts().idxmax(), inplace=True)
test["Embarked"].fillna(test['Embarked'].value_counts().idxmax(), inplace=True)
train.drop('Cabin', axis=1, inplace=True)
test.drop('Cabin', axis=1, inplace=True)
test['Fare'].fillna(test['Fare'].dropna().median(), inplace=True)
train.drop('SibSp', axis=1, inplace=True)
train.drop('Parch', axis=1, inplace=True)
test.drop('SibSp', axis=1, inplace=True)
test.drop('Parch', axis=1, inplace=True)
train = pd.get_dummies(train, columns=["Pclass","Embarked","Sex"], drop_first=True)
test = pd.get_dummies(test, columns=["Pclass","Embarked","Sex"], drop_first=True)
train.drop('PassengerId', axis=1, inplace=True)
test.drop('PassengerId', axis=1, inplace=True)
train.drop('Name', axis=1, inplace=True)
train.drop('Ticket', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)
test.drop('Ticket', axis=1, inplace=True)
train.drop('Embarked_Q', axis=1, inplace=True)
test.drop('Embarked_Q', axis=1, inplace=True)
train.drop('Embarked_S', axis=1, inplace=True)
test.drop('Embarked_S', axis=1, inplace=True)

In [4]:
train.isnull().sum()

Survived    0
Age         0
Fare        0
Pclass_2    0
Pclass_3    0
Sex_male    0
dtype: int64

In [5]:
test.isnull().sum()

Age         0
Fare        0
Pclass_2    0
Pclass_3    0
Sex_male    0
dtype: int64

In [6]:
train_data = train.copy()
train_data.drop('Survived', axis=1, inplace=True)

train_data.describe()
cols = train_data.columns
print(cols)
test

Index(['Age', 'Fare', 'Pclass_2', 'Pclass_3', 'Sex_male'], dtype='object')


Unnamed: 0,Age,Fare,Pclass_2,Pclass_3,Sex_male
0,34.5,7.8292,0,1,1
1,47.0,7.0000,0,1,0
2,62.0,9.6875,1,0,1
3,27.0,8.6625,0,1,1
4,22.0,12.2875,0,1,0
...,...,...,...,...,...
413,27.0,8.0500,0,1,1
414,39.0,108.9000,0,0,0
415,38.5,7.2500,0,1,1
416,27.0,8.0500,0,1,1


In [7]:
scaler = MinMaxScaler() #transform data to 0 or 1 
train_data = scaler.fit_transform(train_data)
test_data = scaler.fit_transform(test)# 0--> did not survive, 1 --> survived

In [8]:
train_data = pd.DataFrame(train_data, columns=[cols])
test_data = pd.DataFrame(test_data, columns=[cols])

In [9]:
from sklearn.ensemble import RandomForestClassifier
#build a random foreset classifier model
y = train["Survived"] # predict the survived variable
X = train_data #uses every variable except the surived variable
X_test = test_data #assigns preprocessed test data to test variable
model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=7)
#create a randomforestclassifier with 200 decision tress, max_depth of 10, and random seed of 7. 
model.fit(X, y) #trains the random forest model using the training data (X as feature and y as label)
predictions = model.predict(X_test)#predicts the survival outcomes

In [10]:
predictions

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [11]:
test_data = pd.read_csv("test.csv")

In [12]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

In [13]:
output

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
