<a href="https://colab.research.google.com/github/avinashbisht1410/supervised-learning-w-python/blob/master/Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case study 2: Petrol consumption using Random Forest

For a random forest regression problem, we will be using the same case study we used for decision tree. In the interest of space, we are progressing after creating the training and testing dataset

Step 1: Import all the libraries and the dataset. We have already covered the steps while we were implementing decision tree algorithms.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
petrol_data = pd.read_csv('petrol_consumption.csv')
X = petrol_data.drop('Petrol_Consumption', axis=1)
y = petrol_data['Petrol_Consumption']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Step 2: Import the random forest regressor library and initiate a RandomForestRegressor variable.

In [3]:
from sklearn.ensemble import RandomForestRegressor
randomForestModel = RandomForestRegressor(n_estimators=200,
                               bootstrap = True,
                               max_features = 'sqrt')

Step 3: Now fit the model on training and testing data.

In [4]:
randomForestModel.fit(X_train, y_train)

Step 4: We will now predict the actual values and check the accuracy of the model.

In [5]:
rf_predictions = randomForestModel.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, rf_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, rf_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, rf_predictions)))

Mean Absolute Error: 47.731999999999985
Mean Squared Error: 3449.5912699999994
Root Mean Squared Error: 58.733221178477855


Step 5: Now we will extract the two most important features. Get the list of all the columns present in the dataset and we will get the numeric feature importance.

In [6]:
feature_list=X_train.columns
importances = list(randomForestModel.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: Population_Driver_licence(%) Importance: 0.45
Variable: Average_income       Importance: 0.27
Variable: Petrol_tax           Importance: 0.15
Variable: Paved_Highways       Importance: 0.14


Step 6: We will now re-create the model with important variables.

In [7]:
rf_most_important = RandomForestRegressor(n_estimators= 500, random_state=5)
important_indices = [feature_list[2], feature_list[1]]
train_important = X_train.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
test_important = X_test.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
train_important = X_train.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]
test_important = X_test.loc[:, ['Paved_Highways','Average_income','Population_Driver_licence(%)']]

Step 7: Train the random forest algorithm.

In [8]:
rf_most_important.fit(train_important, y_train)

Step 8: Make predictions and determine the error.

In [9]:
predictions = rf_most_important.predict(test_important)
predictions

array([605.83 , 484.104, 623.094, 589.88 , 628.962, 607.238, 604.546,
       572.176, 473.598, 510.536])

Step 9: Print the mean absolute error, mean squared error, and root mean squared error.

In [10]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Mean Absolute Error: 56.80640000000001
Mean Squared Error: 4410.0591032
Root Mean Squared Error: 66.40827586378072


As we can observe, after selecting the significant variables the error has reduced for random forest.

Ensemble learning allows us to collate the power of multiple models and then make a prediction. These models individually are weak but together act as a strong model for prediction. And that is the beauty of ensemble learning. We will now discuss pros and cons of ensemble learning.

Advantages of ensemble learning:
1.
An ensemble model can result in lower variance and low bias. They generally have a better understanding of the data.


2.
The accuracy of ensemble methods is generally higher than regular methods.


3.
Random forest model is used to tackle overfitting, which is generally a concern for decision trees. Boosting is used for bias reduction.


4.
And most importantly, ensemble methods are a collection of individual models. Hence, more complex understanding of the data is generated.


Challenges with ensemble learning:
1.
Owing to the complexity of ensemble learning, it is difficult to comprehend. For example, while we can easily visualize a decision tree it is difficult to visualize a random forest model.


2.
Complexity of the models does not make them easy to train, test, deploy, and refresh, which is generally not the case with other models.


3.
Sometimes, ensemble models take a long time to converge and train. And that increases the training time.

Ensemble methods can be divided into two broad categories: bagging and boosting.
1.
Bagging models or bootstrap aggregation improves the overall accuracy by the means of several weak models. The following are major attributes for a bagging model:
a.
Bagging uses sampling with replacement to generate multiple datasets.


b.
It builds multiple predictors simultaneously and independently of each other.


c.
To achieve the final decision an average/vote is done. It means if we are trying to build a regression model, the average or median of all the respective predictions will be taken while for the classification model a voting is done.


d.
Bagging is an effective solution to tackle variance and reduce overfitting.


e.
Random forest is one of the examples of a bagging method (as shown in Figure 2-22).



2.
Boosting : Similar to bagging, boosting also is an ensemble method. The following are the main points about boosting algorithm:
a.
In boosting, the learners are grown sequentially from the last one.


b.
Each subsequent learner improves from the last iteration and focuses more on the errors in the last iteration.


c.
During the process of voting, higher vote is awarded to learners which have performed better.


d.
Boosting is generally slower than bagging but mostly performs better.


e.
Gradient boosting, extreme gradient boosting, and AdaBoosting are a few example solutions.



It is time for us to develop a solution using random forest. We will be exploring more on boosting in Chapter 4, where we study supervised classification algorithms.