## Definition:

 Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. There are many ensemble methods but we will look at the common ones.
 
## Source:
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/ - has hyperparameter tuning for ensemble methods as well

## 1. Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy.
You may have a Logistic Regression classifier, an SVM classifier, a Random Forest
classifier, a K-Nearest Neighbors classifier.

### Hard Voting: 
A very simple way to create an even better classifier is to aggregate the predictions of
each classifier and predict the class that gets the most votes. This majority-vote classifier
is called a hard voting classifier.

This voting classifier often achieves a higher accuracy than the
best classifier in the ensemble. In fact, even if each classifier is a <b>weak learner</b> (meaning
it does only slightly better than random guessing), the ensemble can still be a
<b>strong learner</b> (achieving high accuracy), provided there are a sufficient number of
weak learners and they are sufficiently diverse.

<img src="https://imgur.com/eAruUj3.png" width=500>

### Limitation of Voting classifiers:
However, this is
only true if all classifiers are perfectly independent, making uncorrelated errors,
which is clearly not the case since they are trained on the same data. They are likely to
make the same types of errors, so there will be many majority votes for the wrong
class, reducing the ensemble’s accuracy.

Ensemble methods work best when the <b>predictors are as independent</b>
from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy.

### Soft Voting: 
If all classifiers are able to estimate class probabilities (i.e., they have a pre
dict_proba() method), then you can tell Scikit-Learn to predict the class with the
highest class probability, averaged over all the individual classifiers. This is called soft
voting. It often achieves higher performance than hard voting because it gives more
weight to highly confident votes.

<img src="https://iq.opengenus.org/content/images/2020/01/ud382N9.png" width=500>

## General Bagging and Boosting Algorithms:

Source: https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

1. Bagging algorithms:
    - Bagging meta-estimator
    - Random forest


2. Boosting algorithms:
    - AdaBoost
    - GBM
    - XGBM
    - Light GBM
    - CatBoost


## 2. Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms,
as just discussed. Another approach is to use the same training algorithm for every
predictor, but to train them on different random subsets of the training set. When
sampling is performed with replacement, this method is called <b>bagging</b> (short for
bootstrap aggregating2). When sampling is performed without replacement, it is called
<b>pasting</b>.


## 2.1 Bagging (Boostrap aggregating)

<b> What is a boostrap sample? </b>

A bootstrap sample is a sample of a dataset with replacement. Replacement means that a sample drawn from the dataset is replaced, allowing it to be selected again and perhaps multiple times in the new sample. This means that the sample may have duplicate examples from the original dataset.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRrFzT1XgY4ejMjo1VPelqCwiVNA9jPBxgPn84GmbXB-IED_2YbT3V8vRF5u9kXmJtSmMvuB9c3aCY1z496P8inPEiWRsIDgaRdYA&usqp=CAU&ec=45699845" width=600>

You can see in the above image that in the original dataset we only had 2 purple circles but in Boostrap sample 1 we have 3 instances of purple circles. This is possible due to replacement. After the first purple sample is picked, it is put back in the training set so there is a chance that it can get picked again (since it is getting selected randomly). Hence, a sample may have duplicate examples from original dataset

<b> Why is it called boostrap aggregating?</b>

The first step is to create bootstramp samples. Then you pass these samples to individual samples to the predictor(classifer or regressor). Finally, you aggregate the predictions of different classifiers to get the final result.

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/Bagging.png" width=500>

<b> The Aggregating Step:</b>

Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The aggregation
function is typically the statistical mode (i.e., the most frequent prediction, just like a
hard voting classifier) for classification, or the average for regression. Each individual
predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance. 

The BaggingClassifier automatically performs soft voting
instead of hard voting if the base classifier can estimate class probabilities
(i.e., if it has a predict_proba() method).

### Bagging or Pasting?

Bootstrapping introduces a bit more diversity in the subsets that each predictor is
trained on, so bagging ends up with a slightly higher bias than pasting, but this also
means that predictors end up being less correlated so the ensemble’s variance is
reduced. Overall, bagging often results in better models, which explains why it is generally
preferred. However, if you have spare time and CPU power you can use crossvalidation
to evaluate both bagging and pasting and select the one that works best.

### Out of Bag (OOB) Evaluation:

With bagging, some instances may be sampled several times for any given predictor,
while others may not be sampled at all. By default a BaggingClassifier samples m
training instances with replacement (bootstrap=True), where m is the size of the
training set. This means that only about 63% of the training instances are sampled on
average for each predictor. The remaining 37% of the training instances that are not
sampled are called out-of-bag (oob) instances. Note that they are not the same 37%
for all predictors.

Since a predictor never sees the oob instances during training, it can be evaluated on
these instances, without the need for a separate validation set or cross-validation. You
can evaluate the ensemble itself by averaging out the oob evaluations of each predictor.
In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier or even for RandomForest to
request an automatic oob evaluation after training.

## 2.2 Random Forest

Random Forest is another ensemble machine learning algorithm that follows the <b>bagging</b> technique. It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees. Unlike bagging meta estimator, random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.

<b> Steps:</b>

1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from all decision trees.

The Random Forest algorithm introduces extra randomness when growing trees;
instead of searching for the very best feature when splitting a node, it
searches for the best feature among a random subset of features. This results in a
greater tree diversity, which (once again) trades a higher bias for a lower variance,
generally yielding an overall better model.

### Cost Function:

Since it is an ensemble of Decision trees, by default it is Gini impurity. You can choose entropy as well.

### Assumptions of Random Forest:

No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.

### Data preprocessing:

Ideally, it doesn't need data preprocessing.

<b> Don't one-hot encode for high cardinality datasets:</b>

Decision tree models can handle categorical variables without one-hot encoding them. One-hot encoding categorical variables with high cardinality can cause inefficiency in tree-based ensembles. Continuous variables will be given more importance than the dummy variables by the algorithm which will obscure the order of feature importance resulting in poorer performance.

<b> Feature selection will help for high dimensional data:</b>

If you have high dimensions:
The way to start would be to train a model with all the features and rank them according to feature importance by mean decrease impurity.

Then, you can start removing features from the bottom of the list and evaluate the impact on performance.

Since the random forest algorithm randomly selects features for its trees, having many irrelevant features will simply serve as noise and reduce performance. Normally, the model will perform better as more of them are removed.

Continue the process until performance no longer improves, and you have your feature set.

<b> Transformation helps:</b>

Log-transformations can improve accuracy, especially in case of very skewed data (with very long tails). See for example "Forecasting Bike Sharing Demand" by Jayant Malani et al. (pdf), and [this kaggle submission](#https://www.kaggle.com/ademyttenaere/0-2748-with-rf-and-log-transformation/output).

### Hyperparameter optimization:

1. n_estimators:
    - It defines the number of decision trees to be created in a random forest.
    - Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.

2. criterion:
    - It defines the function that is to be used for splitting.
    - The function measures the quality of a split for each feature and chooses the best split.

3. max_features :
    - It defines the maximum number of features allowed for the split in each decision tree.
    - Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.

4. max_depth:
    - Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.

5. min_samples_split:
    - Used to define the minimum number of samples required in a leaf node before a split is attempted.
    - If the number of samples is less than the required number, the node is not split.

6. min_samples_leaf:
    - This defines the minimum number of samples required to be at a leaf node.
    - Smaller leaf size makes the model more prone to capturing noise in train data.

7. max_leaf_nodes:
    - This parameter specifies the maximum number of leaf nodes for each tree.
    - The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.

8. n_jobs:
    - This indicates the number of jobs to run in parallel.
    - Set value to -1 if you want it to run on all cores in the system.

9. random_state:
    - This parameter is used to define the random selection.
    - It is used for comparison between various models.


### Solution to Overfitting:

Source: https://stats.stackexchange.com/questions/460921/given-a-dataset-how-to-test-your-model-against-the-test-set-if-you-used-stratif

1. n_estimators: In general the more trees the less likely the algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
2. max_features: try reducing this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
3. max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.
4. min_samples_leaf: Try setting this to values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

### Pros vs Cons:

Source: https://www.quora.com/What-are-the-advantages-and-disadvantages-for-a-random-forest-algorithm

#### Pros:

1. Random forest can solve both type of problems that is classification and regression and does a decent estimation at both fronts.
2. One of benefits of Random Forest which exists me most is, the power of handle large data sets with higher dimensionality. It can handle thousands of input variables and identity most significant variables so it is considered as one of the dimensionality reduction method. Further, the model outputs importance of variable, which can be a very handy feature.
3. Good performance on many problems including non linear.
4. Random Forest works well with both categorical and continuous variables.
5. Random Forest is usually robust to outliers and can handle them automatically.

#### Cons:

1. No interpretability
2. Overfitting can easily occur
3.  Complexity: Random Forest creates a lot of trees (unlike only one tree in case of decision tree) and combines their outputs. By default, it creates 100 trees in Python sklearn library. To do so, this algorithm requires much more computational power and resources. On the other hand decision tree is simple and does not require so much computational resources.
4. Longer Training Period: Random Forest require much more time to train as compared to decision trees as it generates a lot of trees (instead of one tree in case of decision tree) and makes decision on the majority of votes.

## 3. Boosting

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

<img src="https://quantdare.com/wp-content/uploads/2016/04/bb3-800x307.png">

<b> Steps:</b>
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd1-e1526989432375.png">
5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)
7. Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd2-e1526989487878.png">
8. Similarly, multiple models are created, each correcting the errors of the previous model.
9. The final model (strong learner) is the weighted mean of all the models (weak learners).
<img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/11/boosting10.png">

## 3.1 AdaBoost

Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.

### References:

- https://youtu.be/LsK-xG1cLYA - Statsquest explains it well
- https://youtu.be/NLRO1-jp5F8 - Explained with an example and calculations

<b> Steps:</b>
1. Initially, all observations in the dataset are given equal weights.
2. A model is built on a subset of data.
3. Using this model, predictions are made on the whole dataset.
4. Errors are calculated by comparing the predictions and actual values.
5. While creating the next model, higher weights are given to the data points which were predicted incorrectly.
6. Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation.
7. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

### Hyperparameter optimization:

1. base_estimators:
    - It helps to specify the type of base estimator, that is, the machine learning algorithm to be used as base learner.
    - SVMs are generally not good base predictors for AdaBoost, because they
are slow and tend to be unstable with AdaBoost.

2. n_estimators:
    - It defines the number of base estimators.
    - The default value is 10, but you should keep a higher value to get better performance.

3. learning_rate:
    - This parameter controls the contribution of the estimators in the final combination.
    - There is a trade-off between learning_rate and n_estimators.

4. max_depth:
    - Defines the maximum depth of the individual estimator.
    - Tune this parameter for best performance.

5. n_jobs
    - Specifies the number of processors it is allowed to use.
    - Set value to -1 for maximum processors allowed.

6. random_state :
    - An integer value to specify the random data split.
    - A definite value of random_state will always produce same results if given with same parameters and training data.
    
### Solutions to overfitting:

If your AdaBoost ensemble is overfitting the training set, you can
try reducing the number of estimators or more strongly regularizing
the base estimator.

1. n_estimators: In general the more trees the less likely the algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.

2. max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.

### Pros vs Cons:

#### Pros:
- Very simple to implement
- Does feature selection resulting in relatively
simple classifier
- Fairly good generalization

#### Cons:
- Suboptimal solution
- Sensitive to noisy data and outliers which can cause overfitting

## 3.2 Gradient Boosting

Just like AdaBoost,
Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor.

### References
- https://youtu.be/Oo9q6YtGzvc - Math behind it

### Hyperparameter Optimization:

1. learning_rate:

    - The learning_rate hyperparameter scales the contribution of each tree. If you set it
to a low value, such as 0.1, you will need more trees in the ensemble to fit the training
set, but the predictions will usually generalize better. This is a regularization technique
called <b>shrinkage.</b>

2. n_estimators:

    - In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the <b>staged_predict()</b> method: it
returns an iterator over the predictions made by the ensemble at each stage of training
(with one tree, two trees, etc.). The following code trains a GBRT ensemble with
120 trees, then measures the validation error at each stage of training to find the optimal
number of trees, and finally trains another GBRT ensemble using the optimal
number of trees

   - Code: 
   
    import numpy as np
    
    from sklearn.model_selection import train_test_split
    
    from sklearn.metrics import mean_squared_error
    
    X_train, X_val, y_train, y_val = train_test_split(X, y)
    
    gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
    
    gbrt.fit(X_train, y_train)
    
    errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
    
    bst_n_estimators = np.argmin(errors)
    
    gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
    
    gbrt_best.fit(X_train, y_train)
    
3. min_samples_split:

    - Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    
4. max_depth:

    - The maximum depth of a tree.
    - Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
    
5. max_features:

    - The number of features to consider while searching for the best split. These will be randomly selected.
    - As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total number of features.
    - Higher values can lead to over-fitting but it generally depends on a case to case scenario.
    
6. subsample:

    - The GradientBoostingRegressor class also supports a subsample hyperparameter,
which specifies the fraction of training instances to be used for training each tree. For
example, if subsample=0.25, then each tree is trained on 25% of the training instances,
selected randomly. As you can probably guess by now, this trades a higher bias
for a lower variance. It also speeds up training considerably. This technique is called
<b>Stochastic Gradient Boosting.</b>

### Solution to overfitting:

Source: https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

General approach-

- Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
- Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters.
- Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
- Lower the learning rate and increase the estimators proportionally to get more robust models.

### Pros vs Cons:

#### Pros:

- Often provides predictive accuracy that cannot be beat.
- Lots of flexibility - can optimize on different loss functions and provides several hyperparameter tuning options that make the function fit very flexible.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Handles missing data - imputation not required.

#### Cons:

- GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause <b>overfitting</b>. Must use cross-validation to neutralize.
- Computationally expensive - GBMs often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, LIME, etc.).

## 3.3 XGBoost

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique.

Let us see how XGBoost is comparatively better than other techniques:

1. Regularization:
    - Standard GBM implementation has no regularisation like XGBoost.
    - Thus XGBoost also helps to reduce overfitting.

2. Parallel Processing:
    - XGBoost implements parallel processing and is faster than GBM .
    - XGBoost also supports implementation on Hadoop.

3. High Flexibility:
    - XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model.

4. Handling Missing Values:
    - XGBoost has an in-built routine to handle missing values.

4. Tree Pruning:
    - XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.

5. Built-in Cross-Validation:
    - XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
    
### Hyperparameter Optimization:

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

1. min_child_weight:
    - Defines the minimum sum of weights of all observations required in a child.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

2. max_depth
    - It is used to define the maximum depth.
    - Higher depth will allow the model to learn relations very specific to a particular sample.

3. max_leaf_nodes
    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    - If this is defined, GBM will ignore max_depth.

4. gamma
    - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.

4. subsample
    - Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
    - Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.
    
### Solution to Overfitting:

Source: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

There are in general two ways that you can control overfitting in XGBoost:

1. The first way is to directly control model complexity.

    - This includes max_depth, min_child_weight and gamma.

2. The second way is to add randomness to make training robust to noise.

    - This includes subsample and colsample_bytree.

    - You can also reduce stepsize eta. Remember to increase num_round when you do so.
    
### Pros vs Cons:

Source:https://towardsdatascience.com/pros-and-cons-of-various-classification-ml-algorithms-3b5bfb3c87d6

#### Pros:
1. Less feature engineering required (No need for scaling, normalizing data, can also handle missing values well)
2. Feature importance can be found out(it output importance of each feature, can be used for feature selection)
3. Fast to interpret
4. Outliers have minimal impact.
5. Handles large sized datasets well.
6. Good Execution speed
7. Good model performance (wins most of the Kaggle competitions)
8. Less prone to overfitting

#### Cons:
1. Difficult interpretation , visualization tough
2. Overfitting possible if parameters not tuned properly.
3. Harder to tune as there are too many hyperparameters.