# Ensembling


## Definition

Combining different machine learning models to get a better prediction.

## Methods

- Averaging(or blending)
- Weighted averaging
- Conditional averaging
- Bagging 
- Boosting
- Stacking
- StackNet

## Averaging


![image.png](images/ensemble_averaging_1.png)
![image.png](images/ensemble_averaging_2.png)
![image.png](images/ensemble_averaging_3.png)


#### Caution: This example shows how it should theoretically, but there is no way to find `if age < 50` without using real target value. 
![image.png](images/ensemble_averaging_4.png)

## Bagging

### Definition

Bagging means averaging slightly different versions of the same model to imporove accuracy. 

### Why bagging 

There are 2 main sources of errors in modeling:
1. Errors due to Bias (underfitting)
2. Error due to Variance (overfitting)

Bagging tries to find average of underfitting and overfitting.

Bagging model of decision trees is Random Forest.

### Parameters that control bagging

- Changing the seed
- Row(Sub) sampling or Bootstrapping
- Shuffling (for some models order of rows matter)
- Column(Sub) sampling
- Model-specific parameters (like regularization parameter for linear models)
- Number of models (or bags)
- (Optionally) parallelism. Bags are independent of each other.

![image.png](images/example_bagging.png)

## Boosting

### Definition

A form of weighted averaging of models where each model is built sequentially via taking into account the past model performance. 

## Main  boosting types

- Weight based
- Residual based

### Weight based boosting

Model calculates weights (by adding absolute error to prediction)
and new model will use weights to generate new training sample. For example, if you specific row weight=2, means on model sample the row will be presented twice. Generally, each sequential model, works mostly on train samples with poor performance from previous model. 

Example provided below:

![image.png](images/wbb_1.png)

![image.png](images/wbb_2.png)

![image.png](images/wbb_3.png)

![image.png](images/wbb_4.png)

![image.png](images/wbb_5.png)

Weight based boosting parameters.

- Learning rate(or shrinkage or eta). Assign values through cross-validation to not overfit.
- Number of estimators. (if increase number of estimators, decrease learning rate). Assign values through cross-validation to not overfit).
- Input mode - can be anaything that accepts weights
- Sub boosting type:
    - AdaBoost - Good implementation in sklearn(python)
    
### Residual based boosting

Residual based boosting working idea: Model_1 predicts `y`, then we compute error(not absolute error, real error, because direction of error matters). Then Model_2 predicts Model_1 Error. So, final prediction Model_1 `y_pred` + Model_2 `model_1_error_pred` * learning rate(eta). We multiply by learning rate becasue we don't want to overrely on one model. Generally `predictionN = pred0 + pred1*eta +...+predN*eta`.

![image.png](images/rbb_1.png)

![image.png](images/rbb_2.png)

![image.png](images/rbb_3.png)

![image.png](images/rbb_4.png)

Residual based boosting parameters

- Learning rate(or shrinkage or eta). Assign values through cross-validation to not overfit.
- Number of estimators. (if increase number of estimators, decrease learning rate). Assign values through cross-validation to not overfit).
- Row(sub) sampling.
- Column(sub) sampling.
- Input model - better be trees.
- Sub boosting type:
    - Fully gradient based. (explained above)
    - Dart. Uses dropout. For example we build 10 models and droupout_rate = 0.2, then model_11 will use only 8 models (randomly drop 0.2 of total trees). Works as a form of regularization. Very efficent for classification.  
    
    
Residual based favorite implementations.
- Xgboost
- Lightgbm 
- H20's GBM (out of the box categorical variables handling)
- Catboost (handles categorical variables + strong initial set of parameters, meaning less tuning needed)
- Sklearn's GBM.

# Stacking


### Defininition

`Stacking` means making predictions of a number of models in a hold-out set and then using a different(Meta) model to train on these predictions 


![image.png](images/stacking_1.png)

![image.png](images/stacking_2.png)

![image.png](images/stacking_3.png)

![image.png](images/stacking_4.png)

![image.png](images/stacking_5.png)

![image.png](images/stacking_6.png)


### Explanation:

`Base learner` are models used to make train set for `Meta model`. 

`Meta model` is a higher level learner uses train data composed from predictions of base learner. Prefix `meta` means `after`.

In the example below:
- `Base learners`: algorithm_0, algorithm_1, algorithm_2
- `Meta model`: algorithm_3
![image.png](images/stacking_7.png)

![image.png](images/stacking_8.png)

### Good things about Stacking

In the models averaging example, we concluded if age<50 then use model_1, else use model_2. But there is no way to know age without looking at target data. Meaning, if we look at the target data, then there is no point, because we want to predict and not look at it. As a result, model averaging example is fully theoretical and not implementabel.

Stacking uses base learners stregths and tries to make combined model taking only stregths of each model, and was able to reach quite the same R^2 score. This is way stacking is very powerful, always use it.
![image.png](images/stacking_9.png)


### Things to be mindful of

- With time sensitive data - respect time. Meaning validation set should in future time than train set, and test set should be in future time than validation set.

- Diversity as important as performance, because each model has its stregth and finds hidden relationship. More diferse base models you use, more hidden relationship you can capture, and reach higher scores on test set. 

- Diversity mat come from:
    - Different algorithms. Use different class of algorithms, to bring diversity. For exaple, use base_learner_1 = linaer model to capture linear dependency, and use base_learner_2 = tree_model to capture non-linear relationships. 
    - Different input features. Try to use different subset of columns and rows. Try to use different encodings for categorical variables. For example, for base_leaerner_1 use one-hot-encoding, and for base_learner_2 use label encoding. 

- Performance plateauing after N models

- Meta model is normallu modest. Use complex models for base learners, becasue they should find different patterns. Use modest(not-so complex) model for meta model, because all patterns found by base learner, meta models just learns how to combine them. 

# StackNet

### Definition

A scalable(because we can train base learners in parralel) meta modeling technology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels. 

Example illustrates how StackNet works:

![image.png](images/stacknet_1.png)

![image.png](images/stacknet_2.png)

![image.png](images/stacknet_3.png)

![image.png](images/stacknet_4.png)

![image.png](images/stacknet_5.png)

![image.png](images/stacknet_6.png)

![image.png](images/stacknet_7.png)

We do not use hold out set, because we needed to re-split again on each layer. So, we use K-Fold method, and for each fold we validate it, with remaining fold, and added to new pred column on respective fold row. After we finish validating all folds, we can use pred column to train it in meta model on given layer. 
![image.png](images/stacknet_8.png)


![image.png](images/stacknet_9.png)

![image.png](images/stacknet_10.png)

# Tips


### 1st level tips:

- Diversity based on algorithms:
    - 2-3 gradient boosted trees. (lightgbm, xgboost, H20, catboost). One deep depth, one medium depth, one shallow depth. Then tune the parameters around them, and make them have similar performance as possible.
    - 2-3 Neural nets (keras, pythorch). One deep NN(3 hidden layers). One middel NN(2 hidden layers). One shallow (1 hidden layer). Why because try to deversify and get a new information.
    - 1-2 ExtraTrees/RandomForest(sklearn).
    - 1-2 linear models as in logistic/ridge regression, linear svm(sklearn).
    - 1-2 knn models (sklearn). KNN models usually add quite nice value in metamodeling context. But when run knn individually performance is not good as others(xgboost).
    - 1 factorization machine (libfm). Factorizes all pairwise interactions.
    - 1 svm with non linear kernel(RBF) if size/memory allows(sklearn).
    
- Diversity based on input data:
    -  Categorical Features: One hot encododing, label encoding, target encoding. 
    - Numerical Features: outliers(take care of outliers and don't take care of outliers), binning (for example from x to z from z to all), derivatives (way of smoothen your variables), percentiles, scaling.
    - Interactions: col1*/+-col2, groupby(for example average of one group), unsupervised(k-means, SVD, PCA).
    
    
### Sebsequent level tips

- Simpler(or shallower) Algorithms:
    - Gradient boosted trees with small depth (like 2 or 3).
    - Linear models with high regularization.
    - ExtraTrees(don't make them too big)
    - Shallow networks (as in 1 hidden layer)
    - knn with BrayCurtisDistance
    
- Feature engineering:
    -  pairwise difference between meta models predictions.(When you create a difference, you essentially force the model to focus on what each new model brings).
    - row-wise statistics like averages or stds
    - Standard feature selection techniques. (Don't know them, on on my mind is Feature importance on RandomForest/Xgboost)

- Rule of thumb. For every 7.5 models in previous level we add 1 in meta. (For example if we have 7 models then we have 1 meta-model, if we have 15 models then we have 2 meta models, and so on)

- Be mindul of target leakage. (How we can control this, by selecting right 'k' on k-fold cross validation. So when we select a very high 'k' value, this means that each model would use more training data, when it makes, and therefor might not generalize well. At the same time, it will exhaust all information about training data. There is no easy way to spot a mistake here, normally you have a test data set, and if you see in your cross-validation that you have a next improvement, that in your test data you don't see it. Then you need to go back and try to reduce the number of K-folds.



### Software for Stacking

- StackNet (https://github.com/kaz-Anova/StackNet). It has parameters section, which is usefull even outside of StackNet. For example which parameters are important for Xgboost. (https://github.com/kaz-Anova/StackNet/blob/master/parameters/PARAMETERS.MD) 
- Stacked ensembles from H20 
- Xcessiv (https://github.com/reiinakano/xcessiv)

## Validation shema for 2-nd level models:

There are a number of ways to validate second level models (meta-models). In this reading material you will find a description for the most popular ones. If not specified, we assume that the data does not have a time component. We also assume we already validated and fixed hyperparameters for the first level models (models).

a) Simple holdout scheme

    1. Split train data into three parts: partA and partB and partC.
    2. Fit N diverse models on partA, predict for partB, partC, test_data getting meta-features partB_meta, partC_meta and test_meta respectively.
    3. Fit a metamodel to a partB_meta while validating its hyperparameters on partC_meta.
    4. When the metamodel is validated, fit it to [partB_meta, partC_meta] and predict for test_meta.

b) Meta holdout scheme with OOF meta-features

    1. Split train data into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in train_data we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them train_meta.
    2. Fit models to whole train data and predict for test data. Let's call these features test_meta.
    3. Split train_meta into two parts: train_metaA and train_metaB. Fit a meta-model to train_metaA while validating its hyperparameters on train_metaB.
    4. When the meta-model is validated, fit it to train_meta and predict for test_meta.

c) Meta KFold scheme with OOF meta-features

    1. Obtain OOF predictions train_meta and test metafeatures test_meta using b.1 and b.2.
    2. Use KFold scheme on train_meta to validate hyperparameters for meta-model. A common practice to fix seed for this KFold to be the same as seed for KFold used to get OOF predictions.
    3. When the meta-model is validated, fit it to train_meta and predict for test_meta.

d) Holdout scheme with OOF meta-features

    1. Split train data into two parts: partA and partB.
    2. Split partA into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in partA we will have N meta-features (also known as out-of-fold predictions, OOF). Let's call them partA_meta.
    3. Fit models to whole partA and predict for partB and test_data, getting partB_meta and test_meta respectively.
    4. Fit a meta-model to a partA_meta, using partB_meta to validate its hyperparameters.
    5. When the meta-model is validated basically do 2. and 3. without dividing train_data into parts and then train a meta-model. That is, first get out-of-fold predictions train_meta for the train_data using models. Then train models on train_data, predict for test_data, getting test_meta. Train meta-model on the train_meta and predict for test_meta.

e) KFold scheme with OOF meta-features

    1. To validate the model we basically do d.1 -- d.4 but we divide train data into parts partA and partB M times using KFold strategy with M folds.
    2. When the meta-model is validated do d.5.

##### Validation in presence of time component

f) KFold scheme in time series

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

    1. Split the train data into chunks of duration T. Select first M chunks.
    2. Fit N diverse models on those M chunks and predict for the chunk M+1. Then fit those models on first M+1 chunks and predict for chunk M+2 and so on, until you hit the end. After that use all train data to fit models and get predictions for test. Now we will have meta-features for the chunks starting from number M+1 as well as meta-features for the test.
    3. Now we can use meta-features from first K chunks [M+1,M+2,..,M+K] to fit level 2 models and validate them on chunk M+K+1. Essentially we are back to step 1. with the lesser amount of chunks and meta-features instead of features.

g) KFold scheme in time series with limited amount of data

We may often encounter a situation, where scheme f) is not applicable, especially with limited amount of data. For example, when we have only years 2014, 2015, 2016 in train and we need to predict for a whole year 2017 in test. In such cases scheme c) could be of help, but with one constraint: KFold split should be done with the respect to the time component. For example, in case of data with several years we would treat each year as a fold.

# Implementation

Chek-out implementation here on Week 4, Assignment. https://github.com/Brandon-HY-Lin/coursera_How-to-Win-a-Data-Science-Competition-Learn-from-Top-Kagglers



# Reading

Kaggle ensembling guide at MLWave.com (overview of approaches): https://mlwave.com/kaggle-ensembling-guide/