[Please follow this.](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2)

A collection of several models working together on a single set is called an Ensemble.

# Voting

1. usage of multiple classifier/regression models to predict and selection of output based on some voting scheme

## Hard Voting
1. model is selected from an ensemble to make the final prediction by a simple majority vote for accuracy.

## Soft Voting
1.  arrives at the best result by averaging out the probabilities calculated by individual algorithms.

2. can only be done when all your classifiers can calculate probabilities for the outcomes.

An analogy

Take a look at the following picture; we can see a group of blindfolded children playing the game of “Touch and tell” while examining an elephant which none of them had ever seen before. Each of them will have a different version as to how does an elephant looks like because each of them is exposed to a different part of the elephant. Now, if we give them a task of submitting a report on elephant description, their individual reports will be able to describe only one part accurately as per their experience but collectively they can combine their observations to give a very accurate report on the description of an elephant.
Similarly, ensemble learning methods employ a group of models where the combined result out of them is almost always better in terms of prediction accuracy as compared to using a single model.

<img src="elephant-ensemble.png" />

In [1]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
	estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc',svm_clf)],
	voting = 'hard')

* The accuracy of the VotingClassifier is generally higher than the individual classifiers. 

* Make sure to include diverse classifiers so that models which fall prey to similar types of errors do not aggregate the errors.

# Simple Ensemble Techniques

## Taking MODE of results
1. most frequently occurring output from all regressors/classifiers

2. can be for both regression and classification type problems, but more suitable for the latter. 

## Taking MEAN of results
1. take an average of predictions from all the models and use it to make the final prediction.

2. Naturally possible for regression-type problems.

## Taking WEIGHTED MEAN of the results

1. All models are assigned different weights defining the importance of each model for prediction.

# Advanced Ensemble Techniques

## Bagging
1. boostrapped aggregation

2. First, we create random samples of the training data set with replacment (we *aren't creating new samples*, we are rather bucketing samples from the main training dataset into smaller subsets).
    
    * since this is with replacement, different classifiers may end up being trained on overlapping training samples.

3. Then, we build a model (classifier or Decision tree) for each sample. 

4. Finally, results of these multiple models are combined using average or majority voting.

5. As each model is exposed to a **different subset of data** and we use their collective output at the end, so we are making sure that problem of overfitting is taken care of by not clinging too closely to our training data set. 

6. Thus, Bagging helps us to reduce the variance error.

7. Combinations of multiple models decreases variance, especially in the case of unstable models, and may produce a more reliable prediction than a single model.


## Boosting

1. iterative technique which adjusts the weight of an **observation** based on the last classification.

2. Boosting is a sequential technique in which, the first algorithm is trained on the entire data set and the subsequent algorithms are built by fitting the residuals of the first algorithm(samples for which the predictions of the current model were inaccurate), thus giving higher weight to those observations that were poorly predicted by the previous model.

3. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.

4. in general decreases the bias error and builds strong predictive models.

5. has shown **<font color="blue">better predictive accuracy than bagging</font>**, but it also **tends to over-fit** the training data as well.

6. It relies on creating a series of weak learners each of which might not be good for the entire data set but is good for some part of the data set. 

7. Thus, each model actually boosts the performance of the ensemble.

8. each new model is influenced by the prediction performance of those built previously


both of these ensemble techniques use **voting** and combine models of the same type

 The aggregate result of multiple models is always less noisy than the individual models.
 
 
 Ensemble models can be used to capture the linear as well as the non-linear relationships in the data.This can be accomplished by using 2 different models and forming an ensemble of the two.
 
 ## Disadvantages
 
1. Reduction in model interpret-ability- 
    1. Using ensemble methods reduces the model interpret-ability due to increased complexity and makes it very difficult to draw any crucial business insights at the end.

2. Computation and design time is high
    1. It is not good for real time applications.

3. The selection of models for creating an ensemble is an art which is really hard to master.

In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    max_samples=100,
    bootstrap=True, # usage of Bagging, for pasting change this to False
    n_jobs=2,
    random_state=42
)

`BaggingClassifier` automatically performs **soft voting** if the classifier can calculate the probabilities for its predictions(`predict_proba()` method)

Bagging is much better than Pasting

* When performing Bagging on a training set, only 63% of the instances are included in the model, that means there are 37% of the instances that the classifier has not seen before. 
    * These can be used for evaluation just like Cross-Validation.
    
    * To use this functionality, simply add a `oob_score = True` parameter in the `BaggingClassifier` 

# Random patches and random subspaces

1. All ensemble techniques up until now sampled only the training instances, but kept all the features.

2. patches samples both training instances and features(out of d features, k are chosen at random, just as how training instances were chosen at random)

3. Random Subspaces keeps all the instances but samples features.

In [3]:
patchedBagClf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    max_samples=0.6,
    bootstrap=True, # usage of Bagging, for pasting change this to False
    n_jobs=2,
    random_state=42,
    bootstrap_features=True
)

subspaceBagClf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=500,
    max_samples=0.6,
    bootstrap=True, # usage of Bagging, for pasting change this to False
    n_jobs=2,
    random_state=42,
    bootstrap_features=True,
    max_features=0.6
)

# Random Forest

1. all voting classifiers are decision-trees

2. if the trees are created to their complete depth
    1. low bias
    
    2. low variance(each tree was overfit due to complete depth, but each decision tree is itself an expert w.r.t. different subsets of the training dataset, hence we will get the best of all such models)

# AdaBoost

1. an ensemble technique based on boosting(ADAptive BOOSTing)
    1. it adjusts adaptively to the errors of the weak hypotheses

2. at the first iteration, all samples are assigned equal weights = $\frac{1}{N}$

3. we usually use decision trees as the consituents of the ensemble

4. contrary to how **full-depth** decision trees are used in random forest, in this technique they are created till the first depth

    1. such trees are called as **stumps**
    
    2. a stump is created for **each feature**
    
    3. the stump with the **minimum gini index/entropy** will be picked as being the **1$^{\textrm{st}}$ base-learner** for this ADAboost technique
    
    4. such one-level decision trees act as the **weak-learners** in a typical boosting ensemble technique.
    
5. for each i$^{\textrm{th}}$ base-learner, total error is calculated as the weighted average of errors for all samples(error for correctly classified = 0, 1 for missclassified), where the weights were assigned in step-2.

6. performance of the stump of the current base-learner is calculated as

    P = $\frac{1}{2}$ log$_e\left(\frac{\textrm{1-TE}}{\textrm{TE}}\right)$, TE = total error, P = also called *stage value*

    
7. increase weights for incorrectly classified and **decrease for correctly classified samples** , 

    1. for **incorrectly classified points** , W$_{\textrm{new}}$ = W$_{\textrm{old}} e^{P}$
    
    2. for correctly classified points , W$_{\textrm{new}}$ = W$_{\textrm{old}} e^{-P}$ 
    
8. after the above weight-update step for each sample, so as to make sure that the weights of all samples are normalized, i.e. sum up to 1, we divide each updated weights by their sum.

14. <font color="red">termination: </font>The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.

12. the new algorithm needs no prior knowledge of the accuracies of the weak hypotheses. Rather, it adapts to these accuracies and generates a weighted majority hypothesis in which the weight of each weak hypothesis is a function of its accuracy.

13. The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.
    1. hence the parameter `n_estimators` is usually set to 500


## Data Preparation for AdaBoost
some heuristics for best preparing your data for AdaBoost.

### Quality Data: 
Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.


### Outliers
* Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. 

* These could be removed from the training dataset.

### Noisy Data

* Noisy data, specifically noise in the output variable can be problematic. 

* If possible, attempt to isolate and clean these from your training dataset.

# Gradient Boosting

1. compute a base model, whose output will be the average of all output quantities(this is obviously for regression), for each sample regardless

2. compute residuals, i.e. errors , but here the loss function is not a typical loss function like MLE, MSE, log-loss etc.
    
    1. error for each sample = compute actual-prediction for this base model(basically actual-mean)

3. construct a decision tree, with the dependent feature being this residual error(for each sample)

    1. hence a variable which represents some kind of residual will be predicted by this decision tree
    
    2. label this as DT1
    
3. add the residual value as predicted by DT1 and the value output from the base-model
    
    1. it usually turns out that this sum is much close to the actual dependent feature value of each sample(original dataset)
    
    2. this is a typical scenario of overfitting
    
    3. currently this combination of base model and DT1 yields a net ML model of low bias but <font color="red">high variance</font>, since for the test data, the predictions will largely remain the same, thereby causing greater variance.
    
4. hence add value output from base-model with a product of a learning rate and the residual as output by DT1.($\bar{y} + R_{\textrm{pred}}.\alpha$)

5. add a new decision tree, DT2, whose dependent feature is actually the outputs from DT1, i.e predicted residual by DT1
    
    1. hence F(x) = h$_0$(x) + $\alpha_1$h$_1$(x) + $\alpha_2$h$_2$(x) .....
    
6. the predicted residuals will usually be decreasing, thus converging after some iterations of newer additions of decision trees

7. since an iterative technique, thus called boosting.


## Pseudo algorithm

1. initialize model with constant value
    F$_0$(X) = argmin<font size="4">$_{\gamma}\left(\sum\limits_{i=1}^{n}L(y, \gamma)\right)$</font>
        
    1. this corresponds to the creation of a base model

    2. $\gamma$ = predicted value by the base-model, since loss has to be minimized, we should in turn be minimizing the loss w.r.t. the predicted value

    3. here we aren't taking the mean of all outputs and declaring it as the output of the base-model, but whatever value of $\gamma$ that we take, it is going to be the same for all samples regardless.

    4. the loss function used generally is the MSE(minimum squared error) \
        L(y, $\gamma = \hat{y}$) = $\sum\limits_{i=1}^{n}\frac{1}{2}\left(y^{(i)}-\hat{y}\right)^{2}$
        
   5. by differentiating this w.r.t. $\gamma$ and equating the differential to 0, we get to know that optimal value of $\gamma$ is equal to the mean over all actual values of the dependent feature.
        
2. iterate m = 1 to M, where m is the total number of trees

    1. Compute pseudo-residuals
        $\gamma_{i, m}$ = -$\left[\frac{\partial L(y, F(x_i)}{\partial F(x_i)}\right]$, where L$_i$(y, $\gamma$) = $\frac{1}{2}\left(y^{(i)}-\hat{y}\right)^{2}$ , here F(x$_i$) = $\hat{y}$ ($\frac{\partial L}{\partial \gamma} = -(y^{(i)} - \gamma)$)
    
    2. this function is actually the subtraction of $\gamma$ from  each dependent feature value
    
    3. fit a decision tree, h$_m$(x) with the dependent feature as this difference calculated.
    
    4. since **gradient is taken w.r.t. the residuals**, its called Gradient boosting.
    
3. $\gamma_m$ = argmin$_{\gamma}\sum\limits_{i=1}^{n}L(y_i, F_{m-1}(x_i)+\gamma)$, where the expression F$_{m-1}(x_i)$ represents the new dependent feature, i.e. difference in base-learner's output and actualresidual.

4. updating the model
    1. F$_m$(x) = F$_{m-1}$(x) + $\alpha.$h(x)

# XGBoost Classifier

1. eXtreme Gradient Boosting

2. The two reasons to use XGBoost are also the two goals of the project:
    1. Execution Speed.
    2. Model Performance.
    
3. This approach supports both regression and classification predictive modeling problems.

4. XGBoost improves upon the base GBM(gradient boosting model) framework through systems optimization and algorithmic enhancements.