# Random Forest Regression from Scratch

Group 18 Members:

- Clara Pichler, 11917694
- Hannah Knapp, 11901857 
- Sibel Toprakkiran, 09426341

### Overview

1. Bootstraping
- `make_bootstraps(df, n_bootstraps=100)`

2. Decision Tree Regression
- `mse(y)`
- `split_dataset(X, y, feature_idx, threshold)`
- `find_best_split(X, y)`
- `build_tree(X, y, max_depth, min_samples_split, depth=0)`
- `predict_tree(tree, X)`

3. Random Forest Regression
- 

4. Random Forest Regression - LLM
- 

5. Evaluation
- Ours
- LLM
- sklearn

In [44]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

We will use the data set `mountains_vs_beaches_preferences.csv` as a data frame for testing our functions.

In [40]:
df_pref = pd.read_csv('data/mountains_vs_beaches_preferences.csv')

df_pref_one_hot = pd.get_dummies(df_pref, columns=['Location', 'Favorite_Season', 'Gender'], drop_first=True)

target_mean = df_pref_one_hot.groupby('Preferred_Activities')['Preference'].mean()
df_pref_one_hot['Preferred_Activities_Encoded'] = df_pref_one_hot['Preferred_Activities'].map(target_mean)

education_mapping = {'high school': 0, 'bachelor': 1, 'master': 2, 'doctorate': 3}
df_pref_one_hot['Education_Level_Encoded'] = df_pref_one_hot['Education_Level'].map(education_mapping)

df_pref_one_hot = df_pref_one_hot.drop(["Education_Level", "Preferred_Activities"], axis=1)

display(df_pref_one_hot.info())
display(df_pref_one_hot.isnull().sum())
display(df_pref_one_hot.head())

display(df_pref_one_hot['Preference'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52444 entries, 0 to 52443
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           52444 non-null  int64  
 1   Income                        52444 non-null  int64  
 2   Travel_Frequency              52444 non-null  int64  
 3   Vacation_Budget               52444 non-null  int64  
 4   Proximity_to_Mountains        52444 non-null  int64  
 5   Proximity_to_Beaches          52444 non-null  int64  
 6   Pets                          52444 non-null  int64  
 7   Environmental_Concerns        52444 non-null  int64  
 8   Preference                    52444 non-null  int64  
 9   Location_suburban             52444 non-null  bool   
 10  Location_urban                52444 non-null  bool   
 11  Favorite_Season_spring        52444 non-null  bool   
 12  Favorite_Season_summer        52444 non-null  bool   
 13  F

None

Age                             0
Income                          0
Travel_Frequency                0
Vacation_Budget                 0
Proximity_to_Mountains          0
Proximity_to_Beaches            0
Pets                            0
Environmental_Concerns          0
Preference                      0
Location_suburban               0
Location_urban                  0
Favorite_Season_spring          0
Favorite_Season_summer          0
Favorite_Season_winter          0
Gender_male                     0
Gender_non-binary               0
Preferred_Activities_Encoded    0
Education_Level_Encoded         0
dtype: int64

Unnamed: 0,Age,Income,Travel_Frequency,Vacation_Budget,Proximity_to_Mountains,Proximity_to_Beaches,Pets,Environmental_Concerns,Preference,Location_suburban,Location_urban,Favorite_Season_spring,Favorite_Season_summer,Favorite_Season_winter,Gender_male,Gender_non-binary,Preferred_Activities_Encoded,Education_Level_Encoded
0,56,71477,9,2477,175,267,0,1,1,False,True,False,True,False,True,False,0.500114,1
1,69,88740,1,4777,228,190,0,1,0,True,False,False,False,False,True,False,0.0,2
2,46,46562,0,1469,71,280,0,0,1,False,True,False,False,True,False,False,0.500114,2
3,32,99044,6,1482,31,255,1,0,1,False,False,False,True,False,False,True,0.500916,0
4,60,106583,5,516,23,151,1,1,0,True,False,False,False,True,False,False,0.0,0


Preference
0    39296
1    13148
Name: count, dtype: int64

## Bootstrapping

Bootstrapping is a method to create multiple subsets of the original data by sampling with replacement. Each subset is used to train one decision tree in the forest.

- Introduces randomness, ensuring that trees see different views of the data.
- Helps in reducing overfitting by decorrelating the trees.

- __Sample Size__: The size of the bootstrap sample is the same as the original dataset.
- __Replacement__: Sampling with replacement ensures diversity between bootstrapped samples.


In [6]:
def make_bootstraps(df, n_bootstraps=100):
    
    dic_boot = {}
    sample_size = df.shape[0]
    idx = [i for i in range(sample_size)]

    for b in range(n_bootstraps):
        
        sidx   = np.random.choice(idx,replace=True,size=sample_size)
        b_samp = df.iloc[sidx,:]
        
        dic_boot['boot_'+str(b)] = {'boot':b_samp}
    
    return(dic_boot)

How our function `make_bootstraps` works:
1. Each data point has equal probability of being selected 
2. Selecting data points from the original sample for the current bootstrap sample, with replacement! Until we reached the same size as the original data
4. Repeating this process until we have `n_bootstraps` bootstrap samples which we save in a dictonary `dic_boot`

In [41]:
dic_boot_pref = make_bootstraps(df_pref_one_hot)

In [42]:
display(dic_boot_pref['boot_0']['boot'].head(5))
display(dic_boot_pref['boot_0']['boot'].info())

Unnamed: 0,Age,Income,Travel_Frequency,Vacation_Budget,Proximity_to_Mountains,Proximity_to_Beaches,Pets,Environmental_Concerns,Preference,Location_suburban,Location_urban,Favorite_Season_spring,Favorite_Season_summer,Favorite_Season_winter,Gender_male,Gender_non-binary,Preferred_Activities_Encoded,Education_Level_Encoded
25985,48,48889,2,3910,165,264,1,1,1,False,True,True,False,False,False,True,0.500114,1
46153,55,52705,7,3381,191,91,1,1,0,False,False,False,True,False,True,False,0.500114,0
3089,34,112580,0,2852,105,155,0,0,1,False,False,True,False,False,True,False,0.500916,1
29763,55,46489,0,4873,66,178,1,0,1,False,True,False,False,True,False,True,0.500916,0
29631,26,68844,4,4537,53,58,1,0,0,False,True,False,True,False,False,True,0.0,0


<class 'pandas.core.frame.DataFrame'>
Index: 52444 entries, 25985 to 48723
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           52444 non-null  int64  
 1   Income                        52444 non-null  int64  
 2   Travel_Frequency              52444 non-null  int64  
 3   Vacation_Budget               52444 non-null  int64  
 4   Proximity_to_Mountains        52444 non-null  int64  
 5   Proximity_to_Beaches          52444 non-null  int64  
 6   Pets                          52444 non-null  int64  
 7   Environmental_Concerns        52444 non-null  int64  
 8   Preference                    52444 non-null  int64  
 9   Location_suburban             52444 non-null  bool   
 10  Location_urban                52444 non-null  bool   
 11  Favorite_Season_spring        52444 non-null  bool   
 12  Favorite_Season_summer        52444 non-null  bool   
 13  Fa

None

In [32]:
duplicates_boot_0 = dic_boot_pref['boot_0']['boot'].duplicated()
display(duplicates_boot_0.sum())
display(dic_boot_pref['boot_1']['boot'].duplicated().sum())
display(dic_boot_pref['boot_2']['boot'].duplicated().sum())

np.int64(19207)

np.int64(19255)

np.int64(19299)

## Decision Tree Regression

First, we define the functions to split the data and calculate metrics like the mean squared error (MSE).
And for the actual building of the trees we recursively split the data into smaller groups, based on feature thresholds, until a stopping condition is met (e.g., max depth or minimum samples per leaf).


In [33]:
def mse(y):
    return np.mean((y - np.mean(y))**2)

`mse(y)` calculates how spread out the data points are from their mean. A lower MSE indicates that the values in `y` are closer to the mean, which means the split effectively reduces variability. 

In [34]:
def split_dataset(X, y, feature_idx, threshold):
    
    left_mask = X[:, feature_idx] <= threshold
    right_mask = ~left_mask

    return X[left_mask], X[right_mask], y[left_mask], y[right_mask]

`split_dataset` divides X (features) and y (target) into two groups based on whether the value of a feature is less than or equal to a given threshold. This is used to evaluate potential splits during training. For example split on feature `Unemployment` at value 6 creates two groups: rows where `Unemployment <= 6` and rows where `Unemployment > 6`.

In [35]:
def find_best_split(X, y, feature_subset=None):

    best_feature, best_threshold = None, None
    best_mse = float("inf")
    features = feature_subset if feature_subset is not None else range(X.shape[1])
    
    for feature_idx in features:
        thresholds = np.unique(X[:, feature_idx])
        
        for threshold in thresholds:
            _, _, y_left, y_right = split_dataset(X, y, feature_idx, threshold)
            
            if len(y_left) == 0 or len(y_right) == 0:
                continue
                
            mse_split = (len(y_left) * mse(y_left) + len(y_right) * mse(y_right)) / len(y)
            
            if mse_split < best_mse:
                best_mse = mse_split
                best_feature = feature_idx
                best_threshold = threshold
                
    return best_feature, best_threshold

`find_best_split` evaluates all possible splits for every feature and every threshold. It selects the split that minimizes the weighted MSE of the left and right groups:
$$
  \text{Weighted MSE} = \frac{n_{\text{left}}}{n_{\text{total}}} \cdot \text{MSE(left)} + \frac{n_{\text{right}}}{n_{\text{total}}} \cdot \text{MSE(right)}
$$
This ensures splits are chosen based on their ability to reduce error. 
The output is the `best_feature` (column index of the splitting feature) and the `best_threshold` (value of the feature where the split happens).

In [36]:
def build_tree(X, y, max_depth, min_samples_split, depth=0):

    if depth >= max_depth or len(y) < min_samples_split or mse(y) == 0:
        return np.mean(y)  

    n_features = X.shape[1]
    feature_subset = np.random.choice(n_features, size=int(np.sqrt(n_features)), replace=False)
    feature_idx, threshold = find_best_split(X, y, feature_subset)
    
    if feature_idx is None:
        return np.mean(y)  
    
    X_left, X_right, y_left, y_right = split_dataset(X, y, feature_idx, threshold)
    left_subtree = build_tree(X_left, y_left, max_depth, min_samples_split, depth + 1)
    right_subtree = build_tree(X_right, y_right, max_depth, min_samples_split, depth + 1)
    
    return {
        "feature_idx": feature_idx,
        "threshold": threshold,
        "left": left_subtree,
        "right": right_subtree,
    }

- __Stopping Conditions__: Stops if the max depth is reached, if there are fewer samples than min_samples_split, or if the MSE is 0 (all values are the same).
- __Recursive Splitting__: For each split, the function creates a left and right subtree until the stopping conditions are met.
- __Leaf Node__: If the recursion stops, the tree stores the mean value of y for prediction.
- __Feature Selection__: Chooses $\sqrt{n}$ features randomly for each tree, where $n$ is the number of original features

In [37]:
def predict_tree(tree, X):

    if isinstance(tree, dict):

        feature_idx = tree["feature_idx"]
        threshold = tree["threshold"]

        if X[feature_idx] <= threshold:
            return predict_tree(tree["left"], X)
        
        else:
            return predict_tree(tree["right"], X)
        
    else:
        return tree  # Leaf node


Traverses the tree based on the input features until you reach a leaf node.
Returns the mean value of the target variable y at the leaf node.

## Random Forest Regression

In [38]:
class RandomForestRegressor:
    
    def __init__(self, n_trees=10, max_depth=5, min_samples_split=10):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []
        
    def fit(self, X, y):

        bootstraps = make_bootstraps(pd.DataFrame(np.hstack((X, y[:, None]))), n_bootstraps=self.n_trees)

        for b in range(self.n_trees):

            bootstrap = bootstraps[f'boot_{b}']['boot']
            X_boot = bootstrap.iloc[:, :-1].values
            y_boot = bootstrap.iloc[:, -1].values

            tree = build_tree(X_boot, y_boot, self.max_depth, self.min_samples_split)
            self.trees.append(tree)
    
    def predict(self, X):
        predictions = np.array([predict_tree(tree, x) for x in X for tree in self.trees])
        predictions = predictions.reshape(self.n_trees, len(X))

        return np.mean(predictions, axis=0)

## Evaluation

__Mean Squared Error (MSE)__

The average of the squared differences between the predicted values and the actual values. It gives more weight to larger errors.

- A smaller MSE value indicates that the model’s predictions are close to the actual values.
- Since it's based on squared differences, large prediction errors (outliers) have a greater impact.
- MSE is in the square of the unit of your target variable
- Useful when large errors are particularly undesirable and need to be penalized more heavily.



__Mean Absolute Error (MAE)__

The average of the absolute differences between predicted values and actual values. Unlike MSE, it treats all errors equally, regardless of size.

- A smaller MAE value indicates better model performance.
- MAE is in the same unit as the target variable, making it more interpretable compared to MSE.
- Good for understanding the typical size of prediction errors.
- Less sensitive to outliers compared to MSE.

__R-squared__

The proportion of variance in the target variable that the model explains. It ranges from:
- __1__: Perfect fit (model explains all variance in the data).
- __0__: Model does no better than predicting the mean of the target.
- __Negative__: Model performs worse than simply predicting the mean.


- A higher R-squared (close to 1) indicates a good fit.
- A low or negative R-squared suggests that your model is not capturing the relationship between the features and target effectively.
- Helps understand how well the model explains the variability in the target variable.
- Not ideal for measuring absolute error but useful for comparing models.


### Our Regressor

In [47]:
X = df_pref_one_hot.drop("Preference", axis=1).values
y = df_pref_one_hot["Preference"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_trees=10, max_depth=5, min_samples_split=10)
rf.fit(X_train, y_train)

predictions = rf.predict(X_test)

In [50]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [51]:
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Model Performance:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R-squared: {r2:.2f}")

Model Performance:
Mean Squared Error (MSE): 0.20
Mean Absolute Error (MAE): 0.38
R-squared (R²): -0.05


## Notes

Max Depth: Prevents trees from growing too deep, which could lead to overfitting.

Min Samples Split: Controls the smallest group size allowed for further splitting, preventing unnecessary splits.


**Key Insights from These Metrics**
1. **MSE**:
   - If it's high, your model is making some large errors that need to be addressed.
   - If it's low, your model is capturing most of the relationship.

2. **MAE**:
   - Directly tells you the average prediction error. 
   - Compare it to the scale of your target variable; if MAE is relatively low, the model is performing well.

3. **R²**:
   - A high R² suggests the model explains a significant portion of the target variable's variance.
   - If R² is low (or negative), consider if the features are truly predictive of the target or if the model is too simple/complex.


**Next Steps Based on Metrics**

- **High MSE or MAE**:
  - Investigate outliers, or whether the model needs better hyperparameter tuning.
  - Consider adding more predictive features or improving feature engineering.

- **Low R²**:
  - Evaluate if features are relevant or add more features to capture the variance.
  - Consider if the model is underfitting or overfitting:
    - Underfitting: Increase `max_depth`, add more `n_trees`.
    - Overfitting: Decrease `max_depth` or regularize.
