# Random Forest Regression

## Intuition behind Random Forest Regression

Random Forest Regression is a version of Ensemble Learning (other versions include Gradient Boosting, Stacking etc.). In Ensemble Learning, you take multiple algorithms or same algorithm multiple times and put them together to make something much more powerful than the original ones.

In Decision Tree Regression, prediction is done based on a single tree. But in Random Forest Regression, prediction is done based on a forest of trees. The steps for Random Forest Regression are as below:

![Random Forest Regression - Intuition](Random-Forest-Regression-Intuition-01.PNG)

This method has 2 inherent advantages:

1.   Since we are taking average across a forest of trees, accuracy is improved. Even if we have some bad trees in the forest, accuracy will not be affected much.
2.   Ensemble Learning models have high stability i.e. even if there are some changes in the dataset that may affect several trees, it will not really impact the forest as a whole.



## Problem Statement

We have a dataset containing salary data of employees of different position levels (1 to 10) in a company. Here,


*   Independent Variable/Feature = Level
*   Dependent/Target Variable = Salary

Here, Salary and Level have non-linear relationship between them. We want to build a model that can predict the salary given the level of an employee.

**Important Note:** Decision Tree Regression works well with complex datasets having multiple features. But it is really not the best adapted to 2D datasets with only one Feature and one Target Variable. Here, we are using a simple dataset with only one feature in order to visualize both X and y in a 2D plot. Hence this dataset is not ideal for visualizing the merits of Decision Tree Regression. 

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

## Taking care of missing data

Here we have no missing data in the dataset.

**Important Note:** If missing data acounts for less than 1% of dataset, we can discard them. But in all other cases, we have to replace missing data. Missing data can be replaced with either mean, median, most frequent data or with a constant using `SimpleImputer` from `sklearn.impute`. Other solutions include `IterativeImputer`, `KNNImputer` and `MissingIndicator`.



## Encoding categorical data

If any of the columns have categorical data, you should apply Encoding to convert them into numerical data. Encoding should be

*   One Hot Encoding if you know there is no ordered relationship in your categorical variable (eg: Country, State etc.)
*   Label Encoding if there is an ordered relationship (eg: Position Levels in a company, Purchase Decisions etc.)



In the dataset, you can see that 'Position' column is effectively encoded in 'Level' column. Thus, no explicit encoding is required. Also, since 'Position' column is redundant, it is not included in the Feature Matrix X.

## Splitting the dataset into the Training set and Test set

Here, we want to predict the salary for Employee Level (which is a continuous real value that varies from 1 to 10). Hence we want the data for all levels from 1 to 10 for accurate prediction. Hence, in this scenario, we are not splitting the dataset.

## Feature Scaling

Since predictions from Decision Tree Regression Model or Random Forest Regression Model are resulting from successive splits of the dataset, they both don't require Feature Scaling. Hence, it is not applied.

## Training the Random Forest Regression model on the whole dataset

## Predicting a new result

## Visualising the Random Forest Regression results (higher resolution)