## Ensembling

Think back to the original reasoning behind why random forests work so well: each tree has errors, but those errors are not correlated with each other, so the average of those errors should tend towards zero once there are enough trees.

Similar reasoning could be used to consider averaging the predictions of models trained using different algorithms (for example, a random forest and a neural network).

In [25]:
import pandas as pd
import numpy as np
from fastai.tabular.all import (
    Categorify,
    FillMissing,
    Normalize,
    F,
    TabularPandas,
    add_datepart,
    cont_cat_split,
    tabular_learner,
    to_np,
)
from fastbook import (
    Path,
    load_pickle,
)
from sklearn.ensemble import RandomForestRegressor

from evaluation import m_rmse, r_mse

### Loading data and setup - Random forest

In [6]:
path = Path('/home/david/.fastai/archive/bluebook-for-bulldozers')
Path.BASE_PATH = path

to = load_pickle(path / "to.pkl")

xs_final = load_pickle(path / "xs_final.pkl")
y = to.train.y
valid_xs_final = load_pickle(path / "valid_xs_final.pkl")
valid_y = to.valid.y

In [7]:
time_vars = ["SalesID", "MachineID"]
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)

In [13]:
def random_forest(
    xs: pd.DataFrame,
    y: pd.Series,
    n_estimators: int = 40,
    max_samples: int = 200_000,
    max_features: float = 0.5,
    min_samples_leaf: int = 5,
    **kwargs
):
    return RandomForestRegressor(
        n_jobs=-1,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        oob_score=True
    ).fit(xs, y)

In [14]:
model = random_forest(xs_final_time, y)
m_rmse(model, valid_xs_time, valid_y)

0.229011

### Loading data and setup - Neural Network

In [16]:
df_nn = pd.read_csv(path / "TrainAndValid.csv", low_memory=False)

df_nn["ProductSize"] = df_nn["ProductSize"].astype("category")

sizes = "Large", "Large / Medium", "Medium", "Small", "Mini", "Compact"
df_nn["ProductSize"].cat.set_categories(sizes, ordered=True, inplace=True)

dep_var = "SalePrice"

df_nn[dep_var] = np.log(df_nn[dep_var])
df_nn = add_datepart(df_nn, "saledate")

condition = (df_nn.saleYear < 2011) | (df_nn.saleMonth < 10)
# np.where is a useful function that returns (as the first element of a tuple) the indices of all True values
train_idx = np.where(condition)[0]
valid_idx = np.where(~condition)[0]

splits = (list(train_idx), list(valid_idx))

  res = method(*args, **kwargs)


In [19]:
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]
cont_nn, cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

batch_size = 1024

procs_nn = [Categorify, FillMissing, Normalize]

to_nn = TabularPandas(
    df_nn_final,
    procs_nn,
    cat_nn,
    cont_nn,
    splits=splits,
    y_names=dep_var
)

dls = to_nn.dataloaders(batch_size)

  to.conts = (to.conts-self.means) / self.stds


In [22]:
model_name = "nn"

learner = tabular_learner(
    dls,
    y_range=(8, 12),
    layers=[500, 250],
    n_out=1,
    loss_func=F.mse_loss
)

learner.load(model_name)

preds, targs = learner.get_preds()

### Combine models

One minor issue we have to be aware of is that our PyTorch model and our sklearn model create data of different types: PyTorch gives us a rank-2 tensor (i.e, a column matrix), whereas NumPy gives us a rank-1 array (a vector). `squeeze` removes any unit axes from a tensor, and to_np converts it into a NumPy array.

In [24]:
rf_preds = model.predict(valid_xs_time)
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2

In [26]:
# This gives us a better result than either model achieved on its own:

r_mse(ens_preds, valid_y)

0.220122

### Boosting

So far our approach to ensembling has been to use bagging, which involves combining many models (each trained on a different data subset) together by averaging them.

There is another important approach to ensembling, called boosting, where we add models instead of averaging them. Here is how boosting works:

- Train a small model that underfits your dataset.
- Calculate the predictions in the training set for this model.
- Subtract the predictions from the targets; these are called the "residuals" and represent the error for each point in the training set.
- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
- Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.


Using this approach, each new tree will be attempting to fit the error of all of the previous trees combined.

Note that, unlike with random forests, with this approach there is nothing to stop us from overfitting.

### Conclusion


We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it's a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try neural nets and GBMs, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them. If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data, and see if that helps your decision trees learn better.