## Regression: Tree-based Methods


**Functions**

`sklearn.ensemble.RandomForestRegressor`, `sklearn.model_selection.GridSearchCV`, `sklearn.ensemble.GradientBoostingRegressor`

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from statsmodels.api import OLS

### Exercise 59

Load the portfolio tracking data and compute the in- and out-of-sample SSE for OLS.

In [2]:
vwm = pd.read_csv("data/VWM.csv", index_col="Date")
vwm.index = pd.to_datetime(vwm.index, format="%Y%m")
vwm = vwm.resample("M").last()

industries = pd.read_csv("data/12_Industry_portfolios.csv", index_col="Date")
industries.index = pd.to_datetime(industries.index, format="%Y%m")
industries = industries.resample("M").last()

In [3]:
x = industries["1980":"2014"]
y = vwm["VWM"]["1980":"2014"]
t, p = x.shape

#### Explanation

We first show the OLS in-sample SSE as a benchmark value, and then its out-of-sample SSE.

In [4]:
tss = y.T @ y
res = OLS(y, x).fit()
print(f"OLS SSE is {tss * (1-res.rsquared):0.1f}")

OLS SSE is 134.5


In [5]:
# Select the out-of-sample data
y_oos = vwm.loc["2015":, "VWM"]
x_oos = industries["2015":]

resid = y_oos - x_oos @ res.params
ols_oos_sse = resid.T @ resid
print(f"The out-of-sample SSE for OLS is {ols_oos_sse:0.2f}")

The out-of-sample SSE for OLS is 20.15


### Exercise 60

Fit a default Random Forest in a reproducible manner to the portfolio tracking data and compute the in- and out-of-sample SSE.

**Warning**: This exercise is simply an example of how to use these methods. In general tree-based models are terrible choices for tracking portfolio construction since the final model is not a weighted combination of the returns, but instead depends on non-linear transformation of the returns. This makes implementation of a tree-based estimator virtually impossible. 

#### Explanation
Random Forests fit ensembles of trees (combinations) using a random sample of the regressors in each.  Here we fit a default Random Forest where we use the $\sqrt{p}$ rule for feature selection within each tree. The should reduce the correlation between trees.

The in-sample SSE is very good and much smaller than the in-sample SSE of OLS.

In [6]:
rfr = RandomForestRegressor(max_features="sqrt", random_state=20201231)
rfr = rfr.fit(x, y)
resid = y - rfr.predict(x)
print(f"The RandomForest SSE is {resid.T@resid:0.1f}")

The RandomForest SSE is 62.9


#### Explanation

The out-of-sample SSE, however, is quite a bit worse than OLS.  Tree-based models are not good models for tracking portfolio construction.

In [7]:
pred = rfr.predict(x_oos)
resid = y_oos - pred
rf_oos_sse = resid.T @ resid
print(f"The out-of-sample SSE for the default RF is {rf_oos_sse:0.2f}")

The out-of-sample SSE for the default RF is 33.93


### Exercise 61

Optimize the key tuning parameters of the Random Forest using cross-validation and compute the out-of-sample SSE of the preferred model.

In [8]:
parameters = {
    "n_estimators": [100, 250, 500, 1000],
    "max_features": ["auto", "sqrt"],
    "max_leaf_nodes": [50, 100, 200, 225, 250],
}

rfr = RandomForestRegressor(random_state=20201231)
gscv = GridSearchCV(
    rfr, parameters, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1
)

gscv = gscv.fit(x, y)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


#### Explanation

`GridSearchCV` allows us to compute the cross-validated score of a model for a combination of input parameters. This method is similar to writing a number of loops across each of the parameters and then cross-validating the model for each distinct combination.  

The key input to `GridSearchCV` is a dictionary where the keys are model parameter names and the values are the values that should be considered in the search.  The model is then automatically cross-validated for all of combinations of the parameters. 

**Note**: This cell may run of an extended period, depending on your system.

The best estimator in the sense of minimizing the score function (negative MSE here) is available using the `best_estimator_` attribute. This is a `RandomForestRegressor` with the CV-optimized parameters. This estimator can then be fit to the data.

In [9]:
rfr_best = gscv.best_estimator_.fit(x, y)
rfr_best

RandomForestRegressor(max_features='sqrt', max_leaf_nodes=225, n_estimators=500,
                      random_state=20201231)

In [10]:
resid = y - rfr_best.predict(x)
print(f"The in-sample SSE of the best model is {resid.T @ resid:0.1f}")

The in-sample SSE of the best model is 58.2


#### Explanation

The in-sample SSE is very good, and is slightly better than the naive attempt.

Note that the cross-validated sse is related to the negative MSE usign the relationship

$$ \text{Neg MSE} = -\frac{SSE_{xv}}{n} $$

The values are stored in a dictionary `gscv.cv_results_` using the key `"mean_test_score"`.  We can convert these to cross-validated SSE for comparison with other methods. These are all higher than what we saw with regression methods.

In [11]:
sse_xv = -t * gscv.cv_results_["mean_test_score"]
sse_xv

array([543.45130064, 542.35097801, 538.82874551, 539.14500168,
       524.50255337, 523.72519581, 520.97157683, 521.23190854,
       523.24923978, 522.95215578, 520.43702707, 520.6435387 ,
       523.23792766, 522.94174719, 520.4335865 , 520.63867577,
       523.23792766, 522.94174719, 520.4335865 , 520.63867577,
       513.04864491, 510.34144513, 507.68919293, 512.52868972,
       498.21866735, 492.91300459, 490.63116003, 494.97361627,
       497.82659792, 492.46257337, 489.54352418, 493.0663823 ,
       497.82127479, 492.45466584, 489.52822532, 493.05273718,
       497.82127479, 492.45466584, 489.52822867, 493.05273887])

#### Explanation

`cv_results_` also contains the parameters used in each configuration. Here we can build a `DataFrame` that examines the better parameterizations by merging these values with the $SSE_{xv}$ and sorting.  We see that the best configurations always used `"sqrt"` for `max_features`, and the 500 consistently outperformed 250 or 1000 estimators. 

In [12]:
df = pd.DataFrame(gscv.cv_results_["params"])
df["sse_xv"] = sse_xv
df.sort_values("sse_xv").head(10)

Unnamed: 0,max_features,max_leaf_nodes,n_estimators,sse_xv
34,sqrt,225,500,489.528225
38,sqrt,250,500,489.528229
30,sqrt,200,500,489.543524
26,sqrt,100,500,490.63116
37,sqrt,250,250,492.454666
33,sqrt,225,250,492.454666
29,sqrt,200,250,492.462573
25,sqrt,100,250,492.913005
35,sqrt,225,1000,493.052737
39,sqrt,250,1000,493.052739


#### Explanation

Finally we can compute the OOS SSE using the `predict` method with the out-of-sample data. This value is poor when compared to OLS.  This indicates (not surprisingly) that tree-based methods are not good ways to fit financial return data.

In [13]:
pred = rfr_best.predict(x_oos)
resid = y_oos - pred
rf_oos_sse = resid @ resid
print(f"The out-of-sample SSE for the optimized RF is {rf_oos_sse:0.2f}")

The out-of-sample SSE for the optimized RF is 36.28


### Exercise 62

Boosting is often a better alternative to Random Forests sine it limits tree depth, and in turn, variable interactions. Fit a default boosted regression tree to the portfolio tracking data, and compute the out-of-sample SSE.

In [14]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(random_state=20201231)
gbr.fit(x, y)
pred = gbr.predict(x_oos)
resid = y_oos - pred
gbr_oos_sse = resid @ resid
gbr_oos_sse

31.039847616420666

#### Explanation

Here we fit a default boosted regression tree using `GradientBoostingRegressor`.  it is always a good idea to set `random_state` to ensure results are reproducible. We compute the OOS SSE and see that the default parameters perform well when compared to either Random Forest.

### Exercise 63

Optimize the key parameters of the boosted regression tree using cross-validation.

In [15]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "learning_rate": [0.01, 0.025, 0.05, 0.1, 0.2],
    "n_estimators": [1000, 2000, 4000, 8000, 12000],
    "max_leaf_nodes": [2, 3, 4, 6],
}

gscv = GridSearchCV(
    gbr, parameters, n_jobs=-1, scoring="neg_mean_squared_error", verbose=1
)
gscv.fit(x, y)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


GridSearchCV(estimator=GradientBoostingRegressor(random_state=20201231),
             n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.025, 0.05, 0.1, 0.2],
                         'max_leaf_nodes': [2, 3, 4, 6],
                         'n_estimators': [1000, 2000, 4000, 8000, 12000]},
             scoring='neg_mean_squared_error', verbose=1)

#### Explanation

Boosted models can be tuned like any other approach.  Here we use `GridSearchCV` again to search for good choices of the learning rate ($\lambda$ in the notes), the number of estimators ($B$ in the notes), and the `max_leaf_nodes` ($d$ in the notes).

**Note**: This cell can take a while to run, depending on your machine.

The preferred configuration has a large number of estimators with a relatively low learning rate and small trees.

In [16]:
best_gbr = gscv.best_estimator_.fit(x, y)
best_gbr

GradientBoostingRegressor(learning_rate=0.025, max_leaf_nodes=3,
                          n_estimators=8000, random_state=20201231)

#### Explanation

When we look at the top performing estimators, we see that small trees combined, slow learning and many estimators consistently perform best.

In [17]:
sse_xv = -t * gscv.cv_results_["mean_test_score"]
df = pd.DataFrame(gscv.cv_results_["params"])
df["sse_xv"] = sse_xv
df = df.sort_values("sse_xv")
df.index = np.arange(1, df.shape[0] + 1)
df.head(10)

Unnamed: 0,learning_rate,max_leaf_nodes,n_estimators,sse_xv
1,0.025,3,8000,291.99108
2,0.025,3,12000,292.238313
3,0.01,3,12000,292.523746
4,0.025,3,4000,294.850653
5,0.05,3,12000,296.995546
6,0.05,3,8000,297.068889
7,0.05,3,4000,297.568802
8,0.01,3,8000,298.710714
9,0.05,3,2000,300.142419
10,0.025,2,8000,306.097345


### Exercise 64

Compute the out-of-sample SSE for the selected boosted regression tree.

In [18]:
pred = best_gbr.predict(x_oos)
resid = y_oos - pred
rf_oos_sse = resid @ resid
rf_oos_sse

23.778563339935815

#### Explanation 

We can generate the out-of-sample SSE for the optimized GBR. We see that while it is substantially improved over what we found with a Regression Tree, it is still 15% worse then what plain OLS achieves.