### **Objective**

In this notebook, we will apply ensemble techniques regression problem in california housing dataset.


We have already applied different regressors on california housing dataset. In this notebook, we will make use of : 

  * Decision tree regressor 

  * Bagging regressor 

  * Random Forest regressor 

We will observe the performance improvement when we use random forest over decision trees and bagging, which also uses decision tree regressor.


### **Importing basic libraries**

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit

In [2]:
np.random.seed(306)

Let's use `ShuffleSplit` as cv with 10 splits and 20% examples set aside as text examples.

In [7]:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

Let's download the data and split it into training and test sets.

In [3]:
features, labels = fetch_california_housing(as_frame=True, return_X_y=True) 
labels *= 100

In [4]:
com_train_features, test_features, com_train_labels, test_labels = train_test_split(features, labels, random_state=42) 

train_features, dev_features, train_labels, dev_labels = train_test_split(
    com_train_features, com_train_labels, random_state=42)

### **Training different Regressors**

Let's train different regressors :

In [11]:
def train_regressor(estimator, X_train, y_train, cv, name):
    cv_results = cross_validate(estimator,
                                X_train,
                                y_train,
                                cv=cv,
                                scoring='neg_mean_absolute_error',
                                return_train_score=True,
                                return_estimator=True)

    cv_train_error = -1 * cv_results['train_score']
    cv_test_error = -1 * cv_results['test_score']

    print(f'On an average, {name} makes an error of ',
            f'{cv_train_error.mean():.3f} (+/-) {cv_train_error.std():.3f} on the training set.')

    print(f'On an average, {name} makes an error of ',
            f'{cv_test_error.mean():.3f} (+/-) {cv_test_error.std():.3f} on the testing set.')


#### **Decision Tree Regressor**

In [12]:
train_regressor(DecisionTreeRegressor() ,com_train_features, com_train_labels ,cv, 'decision tree')

On an average, decision tree makes an error of  0.000 (+/-) 0.000 on the training set.
On an average, decision tree makes an error of  47.456 (+/-) 1.125 on the testing set.


#### **Bagging Regressor**

In [13]:
train_regressor(BaggingRegressor(), com_train_features, com_train_labels, cv, 'bagging regressor')


On an average, bagging regressor makes an error of  14.453 (+/-) 0.167 on the training set.
On an average, bagging regressor makes an error of  35.373 (+/-) 0.943 on the testing set.


#### **Random Forest Regressor**

In [14]:
train_regressor(RandomForestRegressor(), com_train_features, com_train_labels, cv, 'random forest regressor')

On an average, random forest regressor makes an error of  12.654 (+/-) 0.070 on the training set.
On an average, random forest regressor makes an error of  33.208 (+/-) 0.710 on the testing set.


### **Parameter search for random-forest-regressor** 

In [15]:
param_grid = {
    'n_estimators': [1, 2, 5, 10, 20, 50, 100, 200, 500],
    'max_leaf_nodes': [2, 5, 10, 20, 50, 100]
}

In [16]:
search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2), param_grid,
    scoring='neg_mean_absolute_error', n_iter=10, random_state=0, n_jobs=-1,)

search_cv.fit(com_train_features, com_train_labels)

In [17]:
columns = [f'param_{name}' for name in param_grid.keys()]
columns += ['mean_test_error', 'std_test_error']

cv_results = pd.DataFrame(search_cv.cv_results_)

In [18]:
cv_results['mean_test_error'] = -cv_results['mean_test_score']
cv_results['std_test_error'] = cv_results['std_test_score']
cv_results[columns].sort_values(by='mean_test_error')

Unnamed: 0,param_n_estimators,param_max_leaf_nodes,mean_test_error,std_test_error
0,500,100,40.594643,0.703924
2,10,100,40.901778,0.821947
7,100,50,43.889529,0.767655
8,1,100,45.292752,0.983549
9,10,20,49.497432,0.934705
6,50,20,49.512017,1.109143
1,100,20,49.562485,1.050607
3,500,10,54.974162,1.081898
4,5,5,61.522427,1.371272
5,5,2,72.953346,1.245182


In [20]:
error = - search_cv.score(test_features, test_labels)
print(f'On average, our random forest regressor makes an error of {error:.2f}.')

On average, our random forest regressor makes an error of 40.47.
