## Ensemble techniques to solve regression problem  

`California housing dataset`  

[video link](https://youtu.be/cjf5b1dx6Tk)  

In this colab, we will make use of  
* Decision tree regressor  
* Bagging regressor  
* Random forest regressor  

In [2]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit

from sklearn.tree import DecisionTreeRegressor

In [3]:
np.random.seed(306)

Let's use `ShuffleSplit` as cv with 10 splits and 20% examples set aside as test examples.  

In [4]:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

Let's download the data and split into training and test sets.  

In [6]:
# fetch dataset
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100

# train-test split
com_train_features, test_features, com_train_labels, test_labels = train_test_split(
    features, labels, random_state=42)

# train --> train + devs split
train_features, dev_features, train_labels, dev_labels = train_test_split(
    com_train_features, com_train_labels, random_state=42)

### Training different regressors

In [10]:
def train_regressor(estimator, X_train, y_train, cv, name):
    cv_results = cross_validate(estimator,
                                X_train,
                                y_train,
                                cv=cv,
                                scoring='neg_mean_absolute_error',
                                return_train_score=True,
                                return_estimator=True)
    cv_train_error = -1 * cv_results['train_score']
    cv_test_error = -1 * cv_results['test_score']

    print(f"On an average, {name} makes an error of "
            f"{cv_train_error.mean():.3f}k +/- {cv_train_error.std():.3f}k on the training set.")
    print(f"On an average, {name} makes an error of "
            f"{cv_test_error.mean():.3f}k +/- {cv_test_error.std():.3f}k on the test set.")

### Decision Tree regressor  

In [16]:
#@title Decision Tree Regressor
train_regressor(
    DecisionTreeRegressor(), com_train_features, com_train_labels, 
    cv, 'decision tree regressor')

On an average, decision tree regressor makes an error of 0.000k +/- 0.000k on the training set.
On an average, decision tree regressor makes an error of 47.357k +/- 1.074k on the test set.


## Bagging Regressor  

In [15]:
#@title Bagging Regressor  
train_regressor(
    BaggingRegressor(), com_train_features, com_train_labels, 
    cv, 'bagging regressor')


On an average, bagging regressor makes an error of 14.422k +/- 0.204k on the training set.
On an average, bagging regressor makes an error of 35.362k +/- 0.963k on the test set.


## Random Forest Regressor  

In [14]:
#@title Random Forest Regressor  
train_regressor(
    RandomForestRegressor(), com_train_features, com_train_labels, 
    cv, 'random forest regressor')

On an average, random forest regressor makes an error of 12.640k +/- 0.070k on the training set.
On an average, random forest regressor makes an error of 33.192k +/- 0.714k on the test set.


## Parameter search for random forest regressor  

In [19]:
param_distributions = {
    'n_estimators': [1, 2, 5, 10, 20, 50, 100, 200, 500],
    'max_leaf_nodes': [2, 5, 10, 20, 50, 100],
}

search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs = 2), param_distributions=param_distributions,
    scoring='neg_mean_absolute_error', n_iter=10, random_state=0, n_jobs=2,
)
search_cv.fit(com_train_features, com_train_labels)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ['mean_test_error', 'std_test_error']
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results['mean_test_error'] = -cv_results['mean_test_score']
cv_results['std_test_error'] = cv_results['std_test_score']
cv_results[columns].sort_values(by='mean_test_error')


Unnamed: 0,param_n_estimators,param_max_leaf_nodes,mean_test_error,std_test_error
0,500,100,40.627965,0.730054
2,10,100,41.293541,0.746108
7,100,50,44.000227,0.776104
8,1,100,47.35078,0.965162
6,50,20,49.510499,1.142994
1,100,20,49.570059,1.091609
9,10,20,50.269445,1.320338
3,500,10,55.014675,1.108801
4,5,5,61.267048,1.08823
5,5,2,73.169905,1.349971


In [20]:
error = -search_cv.score(test_features, test_labels)
print(f"On an average, our random forest regressor makes an error of {error:.2f} k$")

On an average, our random forest regressor makes an error of 40.31 k$
