# MACHINE LEARNING
## RANDOM FOREST REGRESSOR

Random Forest Regressor (RFR) is an ensemble learning algorithm that combines the prediction of multiple decision trees to improve the accuracy of the model.

Here's how RFR works:

1. **Random Sampling**: RFR starts by randomly selecting a subset of data from the training set.

2. __Decision Tree Creation__: Then, RFR creates a decision tree on this randomly selected data. The tree is created by recursively splitting the data into smaller subsets based on the features that best separate the data into target values.

3. **Tree Ensemble**: The RFR algorithm repeats the above two steps (random sampling and decision tree creation) to create a large number of decision trees.

4. **Prediction**: To make a prediction, RFR takes the average prediction of all decision trees. Each decision tree predicts a value for the target variable based on the features of the input data. The final prediction of the RFR is the average of all the predictions made by each decision tree.

5. **Bagging**: RFR also employs a technique called bagging to reduce overfitting. Bagging randomly selects subsets of features for each decision tree, which reduces the correlation between the trees and helps to improve the accuracy of the model.

Overall, RFR is a powerful algorithm that can handle both regression and classification problems. It is particularly effective when dealing with large datasets that have many features, and it can provide accurate predictions even in the presence of noise or missing data.

### Implementing in python

Random Forest Regressor is used to predict house prices of houses in Melbourne, Australia below.

In [3]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [5]:
melbourne_data = pd.read_csv('House Prices\melb_data.csv')
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [6]:
melbourne_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [8]:
#Splitting the data into X (features) and y (target variable)
X = melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']].dropna(axis = 0)
y = melbourne_data.Price

In [17]:
#Splitting the data set into training and testing DataSets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, train_size = 0.7, random_state = 1)

In [25]:
#Creating an instance of the model 
model = RandomForestRegressor(random_state = 1)

In [26]:
#Fitting the model to the training model
model.fit(X_train, y_train)

In [27]:
#Evaluating the model
score = model.score(X_test, y_test)
print(score)

0.7710555110047177


In [28]:
#Using the model to predict on unseen data
y_pred = model.predict(X_test)

In [29]:
#Calculating the mean absolute error 
print(mean_absolute_error(y_test, y_pred))

183910.24657987937


## Conclusion

An evaluation score of 0.771 and a mean absolute error of 183910.24 is a significant improvement over the previous evaluation score and mean absolute error of the [Decision Tree Regressor](https://github.com/gacheruh/ML-DecisionTreeRegressor/blob/main/House_Prices.ipynb). This suggests that the RandomForestRegressor model is better suited for the prediction of house prices than the Decision Tree Regressor.

The improvement in the evaluation score and mean absolute error can be attributed to the random forest algorithm's ability to reduce overfitting and variance in the model. It also performs feature selection, which helps to eliminate irrelevant or redundant features that could negatively impact the model's performance.

Overall, the RandomForestRegressor model seems to be a better choice for the task of predicting house prices compared to the Decision Tree Regressor.