## Name: {Claire Zhou}
## Section: {02}

# Lab 8: Exploring Random Forests

## Tools

#### Libraries:

- numpy: for processing
- sklearn: for model training  
- pandas: for data processing  
- rfpimp **version 1.3.7**: for feature importance

#### Datasets:

Boston housing 

## Setup

In [2]:
conda install rfpimp

Retrieving notices: ...working... done
Channels:
 - defaults
 - conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/homebrew/anaconda3

  added / updated specs:
    - rfpimp


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-23.10.0              |  py311hca03da5_0         1.3 MB
    rfpimp-1.3.2               |             py_0          12 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.3 MB

The following NEW packages will be INSTALLED:

  rfpimp             conda-forge/noarch::rfpimp-1.3.2-py_0 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2023.7.22~ --> pkgs/main/osx-arm64::certifi-2023.7.22-py311hca03da5_0 

In [4]:
conda install sklearn.ensemble.forest

Channels:
 - defaults
 - conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - sklearn.ensemble.forest

Current channels:

  - defaults
  - https://conda.anaconda.org/conda-forge/noarch
  - https://conda.anaconda.org/conda-forge/osx-arm64

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.



Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

from rfpimp import *

from types import SimpleNamespace
def load_boston(return_X_y=False):
    """Replacement function for loading in Boston House Prices"""
    df = pd.read_csv('boston_house_prices.csv')
    X = df.drop(columns=['MEDV'])
    y = df['MEDV'].to_numpy()

    if return_X_y:
        return X, y 
    
    dataset  = SimpleNamespace(data=X, target=y)
    
    return dataset

In [2]:
def boston():
    boston = load_boston()
    df = boston.data
    y = boston.target
    df['y'] = y
    return df

In [3]:
df_boston = boston()
X, y = df_boston.drop('y', axis=1), df_boston['y']
y *= 1000 # y is "Median value of owner-occupied homes in $1000's" so multiply by 1000
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33


---
## Train random forests of different sizes in terms of number of trees

In this section we will see that as we increase the number of trees in the ensemble/forest, we should initially see model bias going down, i.e. the predictions getting better. It will asymptotically approach some minimum error on the testing set.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Here's how to train a random forest  that has a single tree:

In [5]:
rf = RandomForestRegressor(n_estimators=1)
rf.fit(X_train, y_train)

---
**Task**: Compute the MAE for the training and the testing set, printing them out.

In [6]:
mae_train = mean_absolute_error(y_train, rf.predict(X_train))
mae = mean_absolute_error(y_test, rf.predict(X_test))
print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 1287.4, test 3290.2


---
**Task**: Create a quick loop and run the training and testing cycle above several times to see the variance of the training and testing set errors. You should notice that the test set scores bounce around a lot.

In [10]:
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=1)
    rf.fit(X_train, y_train)
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 1004.5, test 2912.7
MAE train 1210.9, test 2686.3
MAE train 1009.4, test 2946.1
MAE train 1577.0, test 2811.8
MAE train 990.3, test 2724.5


---
**Task**: Increase the number of trees (`n_estimators`) to 2, retrain, and print out the results.

In [11]:
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=2)
    rf.fit(X_train, y_train)
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 1214.9, test 2309.3
MAE train 1307.1, test 3708.3
MAE train 1300.4, test 3214.7
MAE train 1108.8, test 2988.7
MAE train 1171.8, test 3480.4


You should notice that the scores don't bounce around as much as they did when you were only training one tree.

---
**Q.**  Why does the MAE score go down?

### In subsequent iterations, the model may learn from its mistakes in earlier runs, adjusting its predictions to better align with the true underlying patterns in the data

---
**Task**: Increase the number of trees (`n_estimators`) to 10, retrain, and print out the results.

In [12]:
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=10)
    rf.fit(X_train, y_train)
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 867.6, test 2432.6
MAE train 951.9, test 2100.8
MAE train 934.1, test 2264.0
MAE train 1022.3, test 2081.4
MAE train 932.7, test 2299.7


---
**Q.**  What do you notice about the MAE scores?

### The MAE scores are lower with increase in number of trees.

---
**Q.**  After running several times, what else do you notice?

### The MAE scores don't change as much at each iteration as the number of tree increases.

---
**Task**: Increase the number of trees (`n_estimators`) to 200, retrain, and print out the results.

In [13]:
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=200)
    rf.fit(X_train, y_train)
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 868.5, test 1863.3
MAE train 816.9, test 2742.8
MAE train 848.0, test 1877.0
MAE train 833.7, test 2318.2
MAE train 816.8, test 2231.0


---
**Q.**  What do you notice about the MAE scores from a single run?

### The scores are much more stable.

<details>
<summary>Solution</summary>
Both training and testing error have dropped, but not as significantly as before, even with 200 trees.
</details>

---
**Task**: Notice that it took a little bit longer to train.  Do the exact same thing again but this time use `n_jobs=-1` as an argument to the `RandomForestRegressor` constructor.

This tells the library to use all processing cores available on the computer processor. As long as the data is not too huge (because it must pass it around), it often goes much faster using this argument. It should take less than two seconds.

In [15]:
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_jobs=-1, n_estimators=200)
    rf.fit(X_train, y_train)
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 792.1, test 2548.3
MAE train 833.4, test 2005.3
MAE train 835.1, test 1932.3
MAE train 829.3, test 2347.1
MAE train 782.9, test 2265.2


---
**Q.**  What do you notice about the MAE scores from SEVERAL runs?

### Variance is even lower.

<details>
<summary>Solution</summary>
The error variance across runs is even lower (tighter).
</details>

---
## Examining model size and complexity

The structure of a tree is affected by a number of hyperparameters, not just the data. The goal in this section is to see the effect of altering the number of observations per leaf and the maximum number of candidate features per split. Let's start out with a handy function that uses some  support code from rfpimp to examine tree size and depth:

In [16]:
def showsize(ntrees, max_features=1.0, min_samples_leaf=1):
    rf = RandomForestRegressor(n_estimators=ntrees,
                               max_features=max_features,
                               min_samples_leaf=min_samples_leaf,
                               n_jobs=-1)
    rf.fit(X_train, y_train)
    n = rfnnodes(rf)                # from rfpimp
    h = np.median(rfmaxdepths(rf))  # rfmaxdepths from rfpimp
    mae_train = mean_absolute_error(y_train, rf.predict(X_train))
    mae = mean_absolute_error(y_test, rf.predict(X_test))
    print(f"MAE train {mae_train:6.1f}, test {mae:6.1f} using {n:9,d} tree nodes with {h:2.0f} median tree height")

### Effect of number of trees

For a single tree, we see about 480 nodes and a tree height of around 19:

In [17]:
showsize(ntrees=1)

MAE train 1033.4, test 2862.7 using       497 tree nodes with 20 median tree height


---
**Task**: Look at the metrics for 2 trees and then 100 trees.

In [18]:
showsize(ntrees=2)

MAE train 1203.6, test 2914.7 using       942 tree nodes with 18 median tree height


In [19]:
showsize(ntrees=100)

MAE train  822.1, test 2367.1 using    47,556 tree nodes with 19 median tree height


---
**Q.** Why does the median height of a tree stay the same when we increase the number of trees?

### The way the tree is constructed is not impacted by number of trees.

<details>
<summary>Solution</summary>
While the number of nodes increases with the number of trees, the height of any individual tree will stay the same because we have not fundamentally changed how it is constructing a single tree.
</details>

### Effect of increasing min samples / leaf

**Task**: Loop around a call to `showsize()` with 10 trees and min_samples_leaf=1..10 

In [20]:
for i in range(1,11):
    showsize(ntrees=10, min_samples_leaf=i)

MAE train  913.5, test 2381.3 using     4,734 tree nodes with 18 median tree height
MAE train 1036.4, test 2587.4 using     2,210 tree nodes with 16 median tree height
MAE train 1338.1, test 2434.9 using     1,394 tree nodes with 14 median tree height
MAE train 1522.2, test 2395.1 using     1,044 tree nodes with 11 median tree height
MAE train 1633.1, test 2528.4 using       814 tree nodes with 11 median tree height
MAE train 1803.9, test 2685.5 using       666 tree nodes with 11 median tree height
MAE train 1829.8, test 2576.8 using       572 tree nodes with 10 median tree height
MAE train 1919.9, test 2730.0 using       496 tree nodes with  9 median tree height
MAE train 2036.0, test 2472.0 using       426 tree nodes with  9 median tree height
MAE train 2085.1, test 2621.5 using       402 tree nodes with  8 median tree height


---
**Q.** Why do the median height of a tree and number of total nodes decrease as we increase the number of samples per leaf?

### The tree gets splitted less often if the number of samples per leaf is increased.

---
**Q.**  Why does the MAE error increase?

### The average taken over more observations.

<details>
<summary>Solution</summary>
If we include more observations in a single leaf, then the average is taken over more observations. That average is a more general prediction but less accurate.
</details> 

It's pretty clear from that print out that `min_samples_leaf=1` is the best choice because it gives the minimum validation error.

### Effect of reducing max_features

**Task:** Do another loop from `max_features` = 4 down to 1, with 1 sample per leaf. (There are 4 total features.)

In [21]:
p = X_train.shape[1]
for i in range(p,0,-1):
    print(f"{i:2d} ",end='')
    showsize(ntrees=10, max_features=i)

13 MAE train  888.4, test 2443.2 using     4,784 tree nodes with 18 median tree height
12 MAE train  927.7, test 2255.0 using     4,770 tree nodes with 18 median tree height
11 MAE train  865.6, test 2467.5 using     4,880 tree nodes with 18 median tree height
10 MAE train  897.1, test 2337.4 using     4,746 tree nodes with 19 median tree height
 9 MAE train  909.7, test 2170.1 using     4,812 tree nodes with 18 median tree height
 8 MAE train  893.8, test 2565.4 using     4,742 tree nodes with 18 median tree height
 7 MAE train  921.1, test 2393.2 using     4,826 tree nodes with 18 median tree height
 6 MAE train  960.6, test 2202.9 using     4,768 tree nodes with 19 median tree height
 5 MAE train  894.2, test 2342.7 using     4,918 tree nodes with 18 median tree height
 4 MAE train  957.9, test 2567.2 using     4,864 tree nodes with 18 median tree height
 3 MAE train 1007.1, test 2235.2 using     4,888 tree nodes with 19 median tree height
 2 MAE train  992.3, test 2390.1 using     

For this data set, changing the number of candidate features does not change the height of the tree much nor do we see a very clear pattern for the test set error, e.g. the error clearly increasing or decreasing as we decrease the number of candidates. 

## RF prediction confidence

A random forest is a collection of decision trees, each of which contributes a prediction. The forest averages those predictions to provide the overall prediction (or takes most common vote for classification). Let's dig inside the random forest to get the individual trees out and ask them what their predictions are.

**Task**: Train a random forest with 10 trees on `X_train`, `y_train`.  Use `for t in rf.estimators_` to iterate through the trees making predictions with `t` not `rf`. Print out the usual MAE scores for each tree predictor.

In [22]:
rf = RandomForestRegressor(n_estimators=10, n_jobs=-1)
rf.fit(X_train, y_train)

for t in rf.estimators_:
    mae_train = mean_absolute_error(y_train, t.predict(X_train))
    mae = mean_absolute_error(y_test, t.predict(X_test))
    print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

MAE train 1089.9, test 3200.0
MAE train 1064.4, test 3322.5
MAE train 1294.8, test 3052.9
MAE train 1376.7, test 2987.3
MAE train 1123.8, test 3318.6
MAE train 1082.7, test 3602.0
MAE train 1127.0, test 3590.2
MAE train 1406.2, test 3087.3
MAE train 1472.3, test 3019.6
MAE train 1066.1, test 3005.9




Notice that it bounces around quite a bit. 

---
**Task**: Select any one of the `X_test` rows and print out the predicted rent price.

In [23]:
x = X_test.iloc[30,:] # pick single test case
x = x.values.reshape(1,-1)
print(f"{x} => {rf.predict(x)}")

[[  0.43571   0.       10.59      1.        0.489     5.344   100.
    3.875     4.      277.       18.6     396.9      23.09   ]] => [16020.]




---
**Task**: Now let's see how the forest came to that conclusion. Compute the average of the predictions obtained from every tree. Compare that to the prediction obtained directly from the random forest (`rf.predict(X_test)`). They should be the same.

In [24]:
y_pred = np.mean([t.predict(x) for t in rf.estimators_])
print(f"{x} => {y_pred}")

[[  0.43571   0.       10.59      1.        0.489     5.344   100.
    3.875     4.      277.       18.6     396.9      23.09   ]] => 16020.0


<details>
<summary>Solution</summary>
<pre>
y_pred = np.mean([t.predict(x) for t in rf.estimators_])
print(f"{x} => {y_pred}$")
</pre>
</details>

---
**Task**: Compute the standard deviation of the tree estimates and print that out.

In [25]:
np.std([t.predict(x) for t in rf.estimators_])

4721.39809802139

<details>
<summary>Solution</summary>
<pre>
np.std([t.predict(x) for t in rf.estimators_])
</pre>
</details>

The lower the standard deviation, the more tightly grouped the predictions were, which means we should have more confidence in our answer. 

Different records will often have different standard deviations, which means we could have different levels of confidence in the various answers. This might be helpful to a bank for example that wanted to not only predict whether to give loans, but how confident the model was.

## Altering bootstrap size

In this section we will tune one final hyperparameter and see how it affects the model: the `max_samples` which controls the size of the bootstrapped data set used for fitting each tree.

**Task**: There are only about 400 training records, change that to 200 and check the error again.

In [26]:
rf = RandomForestRegressor(n_estimators=200) # don't compute in parallel so we can see timing
%time rf.fit(X_train, y_train)
mae_train = mean_absolute_error(y_train, rf.predict(X_train))
mae = mean_absolute_error(y_test, rf.predict(X_test))
print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

CPU times: user 368 ms, sys: 7.89 ms, total: 376 ms
Wall time: 375 ms
MAE train 809.3, test 2277.0


In [27]:
rf = RandomForestRegressor(n_estimators=200, max_samples=1/2)
%time rf.fit(X_train, y_train)
mae_train = mean_absolute_error(y_train, rf.predict(X_train))
mae = mean_absolute_error(y_test, rf.predict(X_test))
print(f"MAE train {mae_train:.1f}, test {mae:.1f}")

CPU times: user 261 ms, sys: 5.13 ms, total: 266 ms
Wall time: 264 ms
MAE train 1376.8, test 2293.6


It's a bit less accurate, but it's faster.

---
**Q.**  Why is it less accurate?

### Less accurate due to smaller sample size.

---
**Task**: Turn off bootstrapping by adding `bootstrap=False` to the constructor of the model. This means that it will subsample rather than bootstrap. Remember that bootstrapping gets about two thirds of the data because of replacement.

In [28]:
rf = RandomForestRegressor(n_estimators=200, n_jobs=-1, bootstrap=False)
%time rf.fit(X_train, y_train)
mae_train = mean_absolute_error(y_train, rf.predict(X_train))
mae = mean_absolute_error(y_test, rf.predict(X_test))
print(f"MAE train {mae_train:.1f}$, test {mae:.1f}$")

CPU times: user 728 ms, sys: 59.6 ms, total: 787 ms
Wall time: 192 ms
MAE train 0.0$, test 2975.6$


Notice what happened to the training set error. It got quite a bit lower when we do not do bootstrapping, we are overfitting by a lot.