## Smooth Tree
Single decision trees generally overfit, leading to poor predictive performance. Tree ensembles (RF, GBM) perform well, but are black-box models. We investigate whether or not smoothing node values in trees can lead to improved prediction, but in a traceable, white-box model.

In a smoothed regression tree, the node values, $s_n$, will be as follows:

$$s_n = y_n, n=0$$

$$s_n = \frac{w_n y_n + v_{ss} s_p}{w_n + v_{ss}} , n \gt 0$$

Where $y_n$ is the mean of the targets of the node n, $s_n$ is the smoothed node value of node n, $s_p$ is the smoothed value of the parent of node n, $v_{ss}$ is the virtual sample size (a free parameter of the model), and w_n is the total weight of data in node n, or the count if the tree is unweighted. Nodes are numbered from 0, which is the root.

Smoothed classification trees are similar, but operate on class probabilities instead of on the mean of the targets.

### Diabetes
Our first example involves prediction on the *diabetes* dataset. This dataset has 502 examples with 10 predictors. We divide it into a 295 row training set and a 147 row test set. The targets are numeric and are evaluated by mse.

In [6]:
from arboretum.datasets import load_diabetes
xtr, ytr, xte, yte = load_diabetes()
xtr.shape, xte.shape

((295, 10), (147, 10))

We will compare a smoothed regression tree from arboretum to a regression tree and a random forest from scikit-learn. First, we'll just run the models once, then we will investiagte their performance in more detail.

In [8]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from arboretum import SmoothRegressionTree
from sklearn.model_selection import GridSearchCV
dtr = DecisionTreeRegressor(min_samples_leaf=5)
rf = RandomForestRegressor(n_estimators=30, min_samples_leaf=5)
mytree = SmoothRegressionTree(vss= 5, min_leaf=5)

In [19]:
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)

5160.6983167743037

In [18]:
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)

4593.5736258148163

In [20]:
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)

3009.8631643977915

So, off-hand, it looks like the smoothed regression tree is in-between one tree and a random forest in terms of accuracy, but closer to one tree.

### ALS
Next up we consider the ALS dataset. This is a wide, noisy dataset with 369 predictors. The training set has 1197 rows and the test set has 625. It is a tough dataset for trees. They often underperform the constant model.

In [21]:
from arboretum.datasets import load_als
xtr, ytr, xte, yte = load_als()
xtr.shape, xte.shape

((1197, 369), (625, 369))

The constant model gets mse of about 0.32.

In [22]:
mse(yte, 0 * yte + ytr.mean())

0.32037244903736767

In [23]:
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)

0.4578925726636624

In [24]:
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)

0.36398480589778592

In [32]:
rf.n_estimators = 100
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)

0.25648452871621918

The $v_{ss}$ parameter can be changed without refitting on an `arboretum.SmoothRegressionTree`. For this noisy data, much higher smoothing values are better.

In [27]:
mytree.vss = 100
pred = mytree.predict(xte)
mse(yte, pred)

0.2944062345391758

### Random Forest
We can monkey-patch an `arboretum.RFRegressor` to find out if smoothing trees are helpful in random forests. We run the forest once on small data to get numba to jit the RF code.

In [1]:
from arboretum import RFRegressor
myrf = RFRegressor(n_trees=10)
myrf.base_estimator = mytree
myrf.fit(xtr[:10], ytr[:10])

NameError: name 'mytree' is not defined

In [29]:
myrf.n_trees = 100
myrf.fit(xtr, ytr)
pred = myrf.predict(xte)
mse(yte, pred)

0.25948341242450079

So it looks like the smoothing doesn't help any in an RF model.