# **Background**

https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

<img src='https://miro.medium.com/max/1575/1*kISLC1Udq0m6g5kwHhMuJg@2x.png'>

# Bagging

<img src='https://gaussian37.github.io/assets/img/ml/concept/bagging/bagging.png'>

In [None]:
# bagging
from sklearn.ensemble import RandomForestRegressor()

# train is the training data
# test is the test data
# y is the target variable
model = RandomForestRegressor()
bags = 10
seed = 1

# create array object to hold bagged predictions
bagged_prediction = np.zeros(test.shape[0])
# loop for as many times as we want bags
for n in range(bags):
    model.set_params(random_state=seed+n) # update seed
    model.fit(train,y)
    preds = model.predict(test)
    bagged_prediction += preds # add predictions to bagged
# take average of prediction
bagged_prediction /= bags

# **Boosting**

<img src='https://miro.medium.com/max/2936/1*jbncjeM4CfpobEnDO0ZTjw.png'>

**Weight based boosting parameters:**

* Learning rate (or shrinkage or eta):

predictionN = pred0*eta + pred1*eta + pred2*eta + ...

* Number of estimators

* Input model - can be anything that accepts weights

* Sub boosting type:
    
    * AdaBoost - Good implementation in sklearn (Python)
    
    * LogitBoost - Good implementatin in Weka (Java)


Boosting hyperparams tuning:
    
    - Step 1: choose n_estimators = 100, eta = 0.01
    
    - Step 2: increase n_estimators = 200, eta = 0.005 (divided by 2)

**Residual based boosting:**

<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLocciBK8NUcR_nuSs4p0F_v4bhS5zUTiK0FX87-hm7q0UuYO9&s'>

<img src='https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_14%2Fproject_400420%2Fimages%2Fx1.png'>

* Learning rate (or shrinkage or eta)

* Number of estimators

* Row (sub) sampling

* Column (sub) sampling

* Input model - better be trees

* Sub boosting type:
   
   * Fully gradient based
   
   * Dart

**Residual based favourite implementations**

* XGBoost

* LightGBM

* H2O's GBM

* Catboost

* Sklearn's GBM

## **Stacking**

**What is Stacking?**

<img src='https://miro.medium.com/max/519/1*CixzyDU7lptMbXUXNEZEeA.png'>

<img src='https://www.researchgate.net/profile/Mahsa_Soufineyestani/publication/326224119/figure/download/fig1/AS:645263400112129@1530854191613/Stacking-model-1-S1.png'>

**Methodology:**

1. Splitting the train set into 2 disjoint sets

2. Train several base learners on the first part

3. Make predictions with the base learners on the second (validation) part

4. Using the predictions from as the input to train a higher level learner.

In [None]:
# example
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split

# train is the training data
# y is the target variable for the train data
# test is the test data

# Split train data in 2 parts, training and validation
training, valid, ytraining, yvalid = train_test_split(train,y,test_size=0.5)

# specify models
model1 = RandomForestRegressor()
model2 = LinearRegressor()

# fit models
model1.fit(training, ytraining)
model2.fit(training, ytraining)

# make predictions for validation
preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

# form a new dataset for valid and test via stacking the predictions
stacked_predictions = np.column_stack((preds1, preds2))
stacked_test_predictions = np.column_stack((test_preds1, test_preds2))

# specify metal model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)

# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

# **StackNET**

<img src='https://zhangruochi.com/Ensemble-Methods/2019/07/17/18.png'>

<img src='https://www.researchgate.net/publication/332463178/figure/fig1/AS:748521477140480@1555472835085/The-architecture-of-proposed-StackNet-For-the-ensemble-based-regressor-the-number-of.ppm'>

# **Tips and tricks**

**1st level tips**

* Diversity based on algorithms:

    * 2-3 gradient boosted trees (lightgbm, xgboost, H2O, catboost)
    
    * 2-3 Neural nets (keras, pytorch)
    
    * 1-2 ExtraTrees/Random Forest (sklearn)
    
    * 1-2 linear models as in logistic/ridge regression, linear svm (sklearn)
    
    * 1-2 knn models (sklearn)
    
    * 1 Factorization machine (libfm)
    
    * 1svm with nonlinear kernel if size/memory allows (sklearn)
    
* Diversity based on input data:

    * Categorical features: One hot, label encoding, target encoding
    
    * Numerical features: outliers, binning, derivatives, percentiles...
    
    * Interactions: col1*/+- col2, groupby, unsupervised

**Subsequent level tips**

* Simpler (or shallower) algorithms:

    * gradient boosted trees with small depth (like 2 or 3)
    
    * Linear models with high regularization
    
    * Extra Trees
    
    * Shallow networks (as in 1 hidden layer)
    
    * knn with BrayCurtis Distance
    
    * Brute forcing a search for best linear weights based on cv
    
* Feature engineering:

    * pairwise differences between meta features
    
    * row-wise statistics like averages or stds
    
    * standard feature selection techniques
    
* For every 7.5 models in previous level we add 1 in meta

* Be mindful of target leakage

**Software for Stacking**

* StackNet (https://github.com/kaz-Anova/StackNet)

There are a number of ways to validate second level models (meta-models). In this reading material you will find a description for the most popular ones. If not specified, we assume that the data does not have a time component. We also assume we already validated and fixed hyperparameters for the first level models (models).

a) Simple holdout scheme

Split train data into three parts: partA and partB and partC.
Fit N diverse models on partA, predict for partB, partC, test_data getting meta-features partB_meta, partC_meta and test_meta respectively.
Fit a metamodel to a partB_meta while validating its hyperparameters on partC_meta.
When the metamodel is validated, fit it to [partB_meta, partC_meta] and predict for test_meta.

b) Meta holdout scheme with OOF meta-features

Split train data into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in train_data we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them train_meta.
Fit models to whole train data and predict for test data. Let's call these features test_meta.
Split train_meta into two parts: train_metaA and train_metaB. Fit a meta-model to train_metaA while validating its hyperparameters on train_metaB.
When the meta-model is validated, fit it to train_meta and predict for test_meta.

c) Meta KFold scheme with OOF meta-features

Obtain OOF predictions train_meta and test metafeatures test_meta using b.1 and b.2.
Use KFold scheme on train_meta to validate hyperparameters for meta-model. A common practice to fix seed for this KFold to be the same as seed for KFold used to get OOF predictions.
When the meta-model is validated, fit it to train_meta and predict for test_meta.
d) Holdout scheme with OOF meta-features

Split train data into two parts: partA and partB.
Split partA into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in partA we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them partA_meta.
Fit models to whole partA and predict for partB and test_data, getting partB_meta and test_meta respectively.
Fit a meta-model to a partA_meta, using partB_meta to validate its hyperparameters.
When the meta-model is validated basically do 2. and 3. without dividing train_data into parts and then train a meta-model. That is, first get out-of-fold predictions train_meta for the train_data using models. Then train models on train_data, predict for test_data, getting test_meta. Train meta-model on the train_meta and predict for test_meta.

e) KFold scheme with OOF meta-features

To validate the model we basically do d.1 -- d.4 but we divide train data into parts partA and partB M times using KFold strategy with M folds.
When the meta-model is validated do d.5.

Validation in presence of time component

f) KFold scheme in time series

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

Split the train data into chunks of duration T. Select first M chunks.
Fit N diverse models on those M chunks and predict for the chunk M+1. Then fit those models on first M+1 chunks and predict for chunk M+2 and so on, until you hit the end. After that use all train data to fit models and get predictions for test. Now we will have meta-features for the chunks starting from number M+1 as well as meta-features for the test.
Now we can use meta-features from first K chunks [M+1,M+2,..,M+K] to fit level 2 models and validate them on chunk M+K+1. Essentially we are back to step 1. with the lesser amount of chunks and meta-features instead of features.
g) KFold scheme in time series with limited amount of data

We may often encounter a situation, where scheme f) is not applicable, especially with limited amount of data. For example, when we have only years 2014, 2015, 2016 in train and we need to predict for a whole year 2017 in test. In such cases scheme c) could be of help, but with one constraint: KFold split should be done with the respect to the time component. For example, in case of data with several years we would treat each year as a fold.