# LAB 07 - Random Forest for Regression

In this lab we will be extending the previous lab about Decision trees and build a Regression model using Random Forest.

For simplicity, we will be using the same dataset as the previous lab (you can find it in ECLASS).

**IMPORTANT:** For this lab, if you haven't finished your code from last week's lab on Decision trees, you will have the option to use the sklearn implementation for a regression tree. However, this doesn't mean that you should skip the previous lab. This is just so that you don't get behind with the content and you don't spend all your time today working on the previous lab. 

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

As mentioned before, use the Boston Housing data and prepare your train/val/test split as usual.

In [20]:
# Step 1: Load the dataset
data = pd.read_csv('bostonhousing.csv')

# Step 2: Separate the target variable from the features
X = data.drop(columns='medv')
y = data['medv']

# Step 3: Make an 80/10/10 train/validation/test split
# First, split into 80% train and 20% temp (validation + test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# Now split the 20% temp into 50% validation and 50% test (i.e., 10% of the original data each)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Display the shapes of the resulting datasets
print(f"Train set: {X_train.shape}, {y_train.shape}")
print(f"Validation set: {X_val.shape}, {y_val.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")

Train set: (404, 13), (404,)
Validation set: (51, 13), (51,)
Test set: (51, 13), (51,)


## Exercise 1 -- Bootstrap

Also known as [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating), this technique consists of making several samples with replacement of the original data, using each of the samples to train an estimator, and then aggregating the predictions using the average (this is also a type of model ensemble).

In [21]:
def bootstrap(X, num_bags=10):
    """
    Given a dataset and a number of bags,
    sample the dataset with replacement.
    
    This function does not return a copy
    of the datapoints, but a list of indices
    with compatible dimensionality
    
    Parameters
    ----------
    X : ndarray
        A dataset
    num_bags : int, default 10
        The number of bags to create
    
    Returns
    -------
    list of ndarray
        The list contains `num_bags` integer one-dimensional ndarrays.
        Each of these contains the indices corresponding to the 
        sampled datapoints in `X`
    
    Notes
    -----
    * The number of datapoints in each bach will
      match the number of datapoints in the given
      dataset.
    * The
    """
    rng = np.random.default_rng(0)  # you can change the seed, or use 0 to replicate my results
    n_samples = X.shape[0]
    
    bags = [rng.integers(low=0, high=n_samples, size=n_samples) for _ in range(num_bags)]
    
    return bags

In [22]:
rng = np.random.default_rng(0)
X_small = rng.random(size=(100,2))
bags = bootstrap(X_small)
bags[0]

array([85, 63, 51, 26, 30,  4,  7,  1, 17, 81, 64, 91, 50, 60, 97, 72, 63,
       54, 55, 93, 27, 81, 67,  0, 39, 85, 55,  3, 76, 72, 84, 17,  8, 86,
        2, 54,  8, 29, 48, 42, 40,  2,  0, 12,  0, 67, 52, 64, 25, 61, 76,
       38, 46, 99, 80, 98, 37, 68, 95, 65, 84, 68, 70, 38, 87, 13, 57, 72,
       84, 52, 37, 31, 42, 48, 71, 88,  7, 93, 53, 35, 67, 57, 25, 32, 71,
       59, 50, 33, 76, 39, 32, 89, 26, 22, 71, 62,  4,  8, 37, 83],
      dtype=int64)

## Exercise 2 -- Aggregation

The second part of bagging.

In [23]:
def aggregate_regression(preds):
    """
    Aggregate predictions by several estimators
    
    Parameters
    ----------
    preds : list of ndarray
        Predictions from multiple estimators.
        All ndarrays in this list should have the same
        dimensionality.
        
    Return
    ------
    ndarray
        The mean of the predictions
    """
    # Convert the list of ndarrays to a single 2D ndarray
    preds_array = np.array(preds)
    
    # Compute the mean along the first axis (axis=0)
    mean_preds = np.mean(preds_array, axis=0)
    
    return mean_preds

## Exercise 3 -- Random Forest for regression

Using the functions you implemented above, it is now time to put all of them together to train several decision trees and then ensemble them to output a single prediction. For the random forest, however, we need to select a subset of features at each split on the decision tree. 

For this part, you can use the sklearn implementation of Random forest for regression as your estimator for each set of features and bags. See below an example of how to do this, and always remember to check the necessary documentation when using an external function.

Some parameters you will have to set are: 
* num_features: number of features per estimator
* min_samples: min number of samples per leaf node
* max_depth: maximum depth of the decision tree (each estimator)
* num_estimators: number of decision trees you will create using each bag and random set of features

In [None]:
# example of sklearn Decision tree
estimator = DecisionTreeRegressor(max_depth=self.max_depth)
estimator.fit(X, y)
estimator.predict(X)

In [29]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def bootstrap(X, num_bags=10):
    rng = np.random.default_rng(0)  # you can change the seed, or use 0 to replicate my results
    n_samples = X.shape[0]
    bags = [rng.integers(low=0, high=n_samples, size=n_samples) for _ in range(num_bags)]
    return bags

def aggregate_regression(preds):
    preds_array = np.array(preds)
    mean_preds = np.mean(preds_array, axis=0)
    return mean_preds

def train_random_forest(X, y, num_features, min_samples, max_depth, num_estimators, num_bags):
    # Generate bootstrap samples
    bags = bootstrap(X, num_bags=num_bags)
    
    estimators = []

    for bag_indices in bags:
        # Select the bootstrap sample
        X_bag = X[bag_indices]
        y_bag = y[bag_indices]
        
        # Create and train the Decision Tree Regressor
        model = DecisionTreeRegressor(
            max_features=num_features,
            min_samples_leaf=min_samples,
            max_depth=max_depth,
            random_state=0  # Ensuring reproducibility
        )
        model.fit(X_bag, y_bag)
        estimators.append(model)
    
    return estimators

def predict_random_forest(estimators, X):
    # Collect predictions from all estimators
    all_preds = [estimator.predict(X) for estimator in estimators]
    # Aggregate the predictions
    final_preds = aggregate_regression(all_preds)
    return final_preds

# Example usage:
if __name__ == "__main__":
    # Load dataset
    cal_housing = fetch_california_housing()
    X, y = cal_housing.data, cal_housing.target

    # Split into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Parameters for the random forest
    num_features = 5
    min_samples = 2
    max_depth = 10
    num_estimators = 100
    num_bags = 10

    # Train the random forest
    estimators = train_random_forest(X_train, y_train, num_features, min_samples, max_depth, num_estimators, num_bags)

    # Make predictions on the test set
    y_pred = predict_random_forest(estimators, X_test)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    print(f"Mean Squared Error: {mse}")



Mean Squared Error: 0.3092719587422976
