## Bagging

The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result.  
If we create all the models on the same set of data and combine it, it will not be useful as there is a high chance that these models will give the same result since they are getting the same input. One of the technique to overcome this problem is bootstrapping.

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.

Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set.  
1.Multiple subsets are created from the original dataset, selecting observations with replacement.   
2.A base model (weak model) is created on each of these subsets.  
3.The models run in parallel and are independent of each other.  
4.The final predictions are determined by combining the predictions from all the models.  
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/Screenshot-from-2018-05-08-13-11-49.png" alt="Drawing" style="height: 350px;"/>

## Boosting

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.  
1.A subset is created from the original dataset.  
2.Initially, all data points are given equal weights.  
3.A base model is created on this subset.  
4.This model is used to make predictions on the whole dataset.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd1-e1526989432375.png" alt="Drawing" style="height: 150px;"/>
5.Errors are calculated using the actual values and predicted values.  
6.The observations which are incorrectly predicted, are given higher weights.  
 (Here, the three misclassified blue-plus points will be given higher weights)  
7.Another model is created and predictions are made on the dataset.  
 (This model tries to correct the errors from the previous model)
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd2-e1526989487878.png" alt="Drawing" style="height: 150px;"/>
8.Similarly, multiple models are created, each correcting the errors of the previous model.  
9.The final model (strong learner) is the weighted mean of all the models (weak learners).

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd4-e1526551014644.png" alt="Drawing" style="height: 150px;"/>

### Algorithms based on Bagging and Boosting

#### Bagging algorithm:
•[Random forest](https://github.com/ebi-byte/kt/blob/master/trees/Random%20Forest%20.ipynb
)

Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The base estimators in random forest are decision trees. Random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.

#### Boosting algorithms:
•Ada Boost  
•Gradient boosting  
•XG boost

##### Ada Boost
Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.

Below are the steps for performing the AdaBoost algorithm:  
1.Initially, all observations in the dataset are given equal weights.  
2.A model is built on a subset of data.  
3.Using this model, predictions are made on the whole dataset.  
4.Errors are calculated by comparing the predictions and actual values.  
5.While creating the next model, higher weights are given to the data points which were predicted incorrectly.  
6.Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation.  
7.This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

#### Parameters

•**base_estimators**:  
◦It helps to specify the type of base estimator, that is, the machine learning algorithm to be used as base learner.

•**n_estimators**:  
◦It defines the number of base estimators.  
◦The default value is 10, but we should keep a higher value to get better performance.

•**learning_rate**:   
◦This parameter controls the contribution of the estimators in the final combination.  
◦There is a trade-off between learning_rate and n_estimators.

•**max_depth**:  
◦Defines the maximum depth of the individual estimator.  
◦Tune this parameter for best performance.

•**n_jobs**:  
◦Specifies the number of processors it is allowed to use.  
◦Set value to -1 for maximum processors allowed.

•**random_state**:  
◦An integer value to specify the random data split.  
◦A definite value of random_state will always produce same results if given with same parameters and training data.

In [2]:
# Load libraries
import pandas as pd
import numpy as np

# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None)

#split dataset in features and target variable
X = pima.iloc[:, 0:8].values
y = pima.iloc[:, 8].values

#Test-train split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [3]:
#Ada Boost
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(random_state=1)
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.7489177489177489

##### Gradient Boosting (GBM)
Gradient Boosting or GBM is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree.

We will use a simple example to understand the GBM algorithm. We have to predict the age of a group of people using the below data:
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/image-17.png" alt="Drawing" style="height: 150px;"/>
1.The mean age is assumed to be the predicted value for all observations in the dataset.  
2.The errors are calculated using this mean prediction and actual values of age.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/image-18.png" alt="Drawing" style="height: 150px;"/>
3.A tree model is created using the errors calculated above as target variable. Our objective is to find the best split to minimize the error.  
4.The predictions by this model are combined with the predictions 1.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/06/gbm2.png" alt="Drawing" style="height: 150px;"/>
5.This value calculated above is the new prediction.  
6.New errors are calculated using this predicted value and actual value.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/06/gbm3.png" alt="Drawing" style="height: 120px;"/>
7.Steps 2 to 6 are repeated till the maximum number of iterations is reached (or error function does not change).


#### Parameters

•**min_samples_split**:  
◦Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.  
◦Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

•**min_samples_leaf**:  
◦Defines the minimum samples required in a terminal or leaf node.  
◦Generally, lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in the majority will be very small.


•**min_weight_fraction_leaf**:  
◦Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.

•**max_depth**:  
◦The maximum depth of a tree.  
◦Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.

•**max_leaf_nodes**:  
◦The maximum number of terminal nodes or leaves in a tree.  
◦Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.  
◦If this is defined, GBM will ignore max_depth.

•**max_features**:  
◦The number of features to consider while searching for the best split. These will be randomly selected.  
◦As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total number of features.  
◦Higher values can lead to over-fitting but it generally depends on a case to case scenario.

In [4]:
#GBM
from sklearn.ensemble import GradientBoostingClassifier
model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.7662337662337663

##### XGBoost
XGBoost was built to push the limit of computational resources for boosted trees. XGBoost is an implementation of GBM, with major improvements. GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster.

**_Features of XGBoost_**:

1.Split finding algorithms: approximate algorithm:

To find the best split over a continuous feature, data needs to be sorted and fit entirely into memory. This may be a problem in case of large datasets.

An approximate algorithm is used for this. Candidate split points are proposed based on the percentiles of feature distribution. The continuous features are binned into buckets that are split based on the candidate split points. The best solution for candidate split points is chosen from the aggregated statistics on the buckets.

2.Column block for parallel learning:

Sorting the data is the most time-consuming aspect of tree learning. To reduce sorting costs, data is stored in in-memory units called ‘blocks’. Each block has data columns sorted by the corresponding feature value. This computation needs to be done only once before training and can be reused later.

Sorting of blocks can be done independently and can be divided between parallel threads of the CPU. The split finding can be parallelized as the collection of statistics for each column is done in parallel.

3.Weighted quantile sketch for approximate tree learning:

To propose candidate split points among weighted datasets, the Weighted Quantile Sketch algorithm is used. It carries out merge and prune operations on quantile summaries over the data.

4.Sparsity-aware algorithm:

Input may be sparse due to reasons such as one-hot encoding, missing values and zero entries. XGBoost is aware of the sparsity pattern in the data and visits only the default direction (non-missing entries) in each node.

5.Out-of-core computation:

For data that does not fit into main memory, divide the data into multiple blocks, and store each block on the disk. Compress each block by columns and decompress on the fly by an independent thread while disk reading.

6.Regularized Learning Objective:

To measure the performance of a model given a certain set of parameters, we need to define an objective function. An objective function must always contain two parts: training loss and regularization. The regularization term penalizes the complexity of the model.


Obj(Θ)=L(θ)+ Ω(Θ)


where Ω is the regularization term which most algorithms forget to include in the objective function. However, XGBoost includes regularization, thus controlling the complexity of the model and preventing overfitting.

The above 6 features maybe individually present in some algorithms, but XGBoost combines these techniques to make an end-to-end system that provides scalability and effective resource utilization.

#### Parameters

•**nthread**:  
◦This is used for parallel processing and the number of cores in the system should be entered.  
◦If you wish to run on all cores, do not input this value. The algorithm will detect it automatically.

•**eta**:  
◦Analogous to learning rate in GBM.  
◦Makes the model more robust by shrinking the weights on each step.

•**min_child_weight**:  
◦Defines the minimum sum of weights of all observations required in a child.  
◦Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

•**max_depth**:  
◦It is used to define the maximum depth.  
◦Higher depth will allow the model to learn relations very specific to a particular sample.

•**max_leaf_nodes**:  
◦The maximum number of terminal nodes or leaves in a tree.  
◦Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.  
◦If this is defined, XGB will ignore max_depth.

•**gamma**:  
◦A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.  
◦Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.

•**subsample**:  
◦Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.  
◦Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.

•**colsample_bytree**:  
◦It is similar to max_features in GBM.  
◦Denotes the fraction of columns to be randomly sampled for each tree.

#### Questionnaire

##### What are ensemble methods in tree based modeling ?
##### How do Bagging and Boosting get N learners?
##### Which is the best, Bagging or Boosting?

[Solution](https://github.com/ebi-byte/kt/blob/master/trees/Trees%20Questionnaire.ipynb
)