# Ensemble Techniques

Ensemble models in machine learning operate on the idea of combining the decisions from multiple models to improve the overall performance.

# Max Voting

The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

Here's how to do it:

```python
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)

final_pred = np.array([])
for i in range(0,len(x_test)):
    final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]]))
```

Alternatively, you can use “VotingClassifier” module in sklearn as follows:

```python
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)
```

# Averaging

Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

Here's a generic example: 

```python
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3
```

# Weighted Average

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction.

```python
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
```

# Stacking

Stacking addresses the question:

*Given multiple machine learning models that are skillful on a problem, but in different ways, how do you choose which model to use (trust)?*

The approach to this question is to use another machine learning model that learns when to use or trust each model in the ensemble.

* Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset).
* Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models).

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

* Level-0 Models (Base-Models): Models fit on the training data and whose predictions are compiled.
* Level-1 Model (Meta-Model): Model that learns how to best combine the predictions of the base models.

The meta-model is trained on the predictions made by base models on out-of-sample data. That is, data not used to train the base models is fed to the base models, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model.

The outputs from the base models used as input to the meta-model may be real value in the case of regression, and probability values, probability like values, or class labels in the case of classification.

The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model.

The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model.

Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset.

Stacking is **appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways.** Another way to say this is that the predictions made by the models or the errors in predictions made by the models are uncorrelated or have a low correlation.

Base-models are often complex and diverse. As such, it is often a **good idea to use a range of models that make very different assumptions** about how to solve the predictive modeling task, such as linear models, decision trees, support vector machines, neural networks, and more. Other ensemble algorithms may also be used as base-models, such as random forests.

The meta-model is often **simple, providing a smooth interpretation** of the predictions made by the base models. As such, linear models are often used as the meta-model, such as linear regression for regression tasks (predicting a numeric value) and logistic regression for classification tasks (predicting a class label). Although this is common, **it is not required.** This depends on the data and problem. However, it is important to have "diverse" models as base learners. So "weak" linear models may be okay as well as "diverse" boosted or tree based models (e.g. different objectives in boosting or so). The choice of **the meta learner still depends on the structure of the data after stacking. You could check some quick linear models first.**

* Regression Meta-Model: Linear Regression.
* Classification Meta-Model: Logistic Regression.

Stacking is designed to improve modeling performance, although is not guaranteed to result in an improvement in all cases.

Achieving an improvement in performance depends on the complexity of the problem and whether it is **sufficiently well represented by the training data** and complex enough that there is **more to learn by combining predictions.** It is also dependent upon the choice of base models and whether they are sufficiently skillful and sufficiently uncorrelated in their predictions (or errors).

If a base-model performs as well as or better than the stacking ensemble, the base model should be used instead, given its lower complexity (e.g. it’s simpler to describe, train and maintain).

Here's how to do it in python:

```python
level0 = list()
level0.append(('lr', LogisticRegression()))
level0.append(('knn', KNeighborsClassifier()))
level0.append(('cart', DecisionTreeClassifier()))
level0.append(('svm', SVC()))
level0.append(('bayes', GaussianNB()))
 # define meta learner model
level1 = LogisticRegression()
 # define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
```

# Blending

Blending can be a colloquial term for ensemble learning with a stacking-type architecture model. It is rarely, if ever, used in textbooks or academic papers, other than those related to competitive machine learning.

Most commonly, blending is used to describe the specific application of stacking where the meta-model is trained on the predictions made by base-models on **a hold-out validation dataset.** In this context, stacking is reserved for a meta-model that is trained on out-of fold predictions during a cross-validation procedure.

* Blending: Stacking-type ensemble where the meta-model is trained on predictions made on a holdout dataset.
* Stacking: Stacking-type ensemble where the meta-model is trained on out-of-fold predictions made during k-fold cross-validation.

It's strengths are said to be that it is a bit simpler and has **less risk of an information leak.**

## Implementation in Python from ground up

First, we can enumerate the list of models and fit each in turn on the training dataset. Also in this loop, we can use the fit model to make a prediction on the hold out (validation) dataset and store the predictions for later.

```python
 # fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val):
	# fit all models on the training set and predict on hold out set
	meta_X = list()
	for name, model in models:
		# fit in training set
		model.fit(X_train, y_train)
		# predict on hold out set
		yhat = model.predict(X_val)
		# reshape predictions into a matrix with one column
		yhat = yhat.reshape(len(yhat), 1)
		# store predictions as input for blending
		meta_X.append(yhat)
	# create 2d array from predictions, each set is an input feature
	meta_X = hstack(meta_X)
	# define blending model
	blender = LogisticRegression()
	# fit on predictions from base models
	blender.fit(meta_X, y_val)
	return blender
```

Do note that the *input data* for prediction needs to be in the form of “meta_X”.

The next step is to use the blending ensemble to make predictions on new data.

This is a two-step process. The first step is to use each base model to make a prediction. These predictions are then gathered together and used as input to the blending model to make the final prediction.

We can use the same looping structure as we did when training the model. That is, we can collect the predictions from each base model into a training dataset, stack the predictions together, and call predict() on the blender model with this meta-level dataset.

The predict_ensemble() function below implements this. Given the list of fit base models, the fit blender ensemble, and a dataset (such as a test dataset or new data), it will return a set of predictions for the dataset.

```python
 # make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test):
	# make predictions with base models
	meta_X = list()
	for name, model in models:
		# predict with base model
		yhat = model.predict(X_test)
		# reshape predictions into a matrix with one column
		yhat = yhat.reshape(len(yhat), 1)
		# store prediction
		meta_X.append(yhat)
	# create 2d array from predictions, each set is an input feature
	meta_X = hstack(meta_X)
	# predict
	return blender.predict(meta_X)
```
We now have all of the elements required to implement a blending ensemble for classification or regression predictive modeling problems

# Bagging

Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm.

Specifically, it is an ensemble of decision tree models, although the bagging technique can also be used to combine the predictions of other types of models.

As its name suggests, bootstrap aggregation is based on the idea of the “bootstrap” sample.

A bootstrap sample is a sample of a dataset with replacement. Replacement means that a sample drawn from the dataset is replaced, allowing it to be selected again and perhaps multiple times in the new sample. This means that the sample may have duplicate examples from the original dataset.

The bootstrap sampling technique is used to estimate a population statistic from a small data sample. This is achieved by drawing multiple bootstrap samples, calculating the statistic on each, and reporting the mean statistic across all samples.

Predictions are made for **regression problems by averaging** the prediction across the decision trees. Predictions are made for **classification problems by taking the majority vote** prediction for the classes from across the predictions made by the decision trees.

The algorithm used in ensemble should have a moderate variance, meaning it is moderately dependent upon the specific training data.

The decision tree is the default model to use because it works well in practice. Other algorithms can be used as long as they are configured to have a moderate variance.

The chosen **algorithm to ensemble should be moderately stable**, not unstable like a decision stump and not very stable like a pruned decision tree, typically an unpruned decision tree is used. Stability is the ability to generalize a problem. Or, a learning algorithm is said to be stable if the learned model doesn't change much when the training dataset is modified. It's important to notice the word “much” in this definition. That's the part about putting an upper bound. A model changes when you change the training set.

The bagged decision trees are effective because each decision tree is fit on a slightly different training dataset, which in turn allows each **tree to have minor differences and make slightly different skillful predictions.**

Technically, we say that the method is effective because the trees have a low correlation between predictions and, in turn, prediction errors.

Decision trees, specifically unpruned decision trees, are used as they slightly overfit the training data and have a high variance. Other high-variance machine learning algorithms can be used, such as a k-nearest neighbors algorithm with a low k value, although decision trees have proven to be the most effective.

Bagging does not always offer an improvement. For low-variance models that already perform well, bagging can result in a decrease in model performance.

The performance of the model will **converge with the increase of the number of decision trees to a point then remain level.**

Here's how to do it in python:

```python
 # define the model
model = BaggingClassifier()
 # fit the model on the whole dataset
model.fit(X_train, y_train)
 # make a prediction
yhat = model.predict(X_test)
```

## Hyperparameters

The Bagging Classifier has a number of important hyperparameters to tune. They are: 

### Number of Trees

An important hyperparameter for the Bagging algorithm is the number of decision trees used in the ensemble.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Bagging and related ensemble of decision trees algorithms (like random forest) appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “n_estimators” argument and defaults to 100. 

### Number of Samples

The size of the bootstrap sample can also be varied.

The default is to create a bootstrap sample that has the same number of examples as the original dataset. Using a smaller dataset can increase the variance of the resulting decision trees and could result in better overall performance.

### Alternate Algorithm

Decision trees are the most common algorithm used in a bagging ensemble.

The reason for this is that they are easy to configure to have a high variance and because they perform well in general.

Other algorithms can be used with bagging and must be configured to have a modestly high variance. One example is the k-nearest neighbors algorithm where the k value can be set to a low value.

The algorithm used in the ensemble is specified via the “base_estimator” argument and must be set to an instance of the algorithm and algorithm configuration to use. 

It's also a good practice to try different parameters to be used in the "base_estimator" too.

## Extensions: Other Ensemble Techniques based on Bagging

There are various extensions of the Bagging Algorithm that are worthwile to know.

### Pasting Ensemble

The Pasting Ensemble is an extension to bagging that involves fitting ensemble members based on **random samples of the training dataset** instead of bootstrap samples.

The approach is designed to use smaller sample sizes than the training dataset in cases where the training dataset does not fit into memory.

The example below demonstrates the Pasting ensemble by setting the “bootstrap” argument to “False” and setting the number of samples used in the training dataset via “max_samples” to a modest value, in this case, 0.5 or 50 percent of the training dataset size.

### Random Subspaces Ensemble

A Random Subspace Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from **random subsets of the features** in the training dataset.

It is similar to the random forest except the data samples are random rather than a bootstrap sample and the **subset of features is selected** for the entire decision tree rather than at each split point in the tree.

The example below demonstrates the Random Subspace ensemble by setting the “bootstrap” argument to “False” and setting the number of features used in the training dataset via “max_features” to a modest value, in this case, 10 features.

### Random Patches Ensemble

The Random Patches Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from random subsets of rows (samples) and columns (features) of the training dataset.

It does not use bootstrap samples and might be considered an ensemble that combines both the random sampling of the dataset of the Pasting ensemble and the random sampling of features of the Random Subspace ensemble.

# Random Forest

Random Forest is worth noting in a different section although it is an extension of Bagging algorithm (it's a specific case of Random Subspace Ensemble). 

It can be used for classification and regression problems.

Random Forest is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are **unpruned, making them slightly overfit to the training dataset.** This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

Predictions from the trees are averaged (or taken mode of) across all decision trees resulting in better performance than any single tree in the model.

Random forest also involves **selecting a subset of input features (columns or variables)** at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

Perhaps the most important hyperparameter to tune for the random forest is the number of random features to consider at each split point.

**A good heuristic for regression is to set this hyperparameter to 1/3 the number of input features.**
* num_features_for_split = total_input_features / 3

**A good heuristic for classification is to set this hyperparameter to the square root of the number of input features.**
* num_features_for_split = sqrt(total_input_features)

Another important hyperparameter to tune is the depth of the decision trees. **Deeper trees are often more overfit** to the training data, but also less correlated, which in turn may improve the performance of the ensemble. Depths from 1 to 10 levels may be effective.

The number of decision trees in the ensemble can be set. Often, this **number of decision trees is increased until no further improvement is seen.**

Finally, it's a only a rough initial guide. Do remember to play with it, if you can.

Here's how to do in python:

```python
 # define the model
model = RandomForestClassifier()
 # fit the model on the whole dataset
model.fit(X, y)
```

# Boosting

Boosting operates in a similar manner to Bagging. Multiple models (commonly decision trees) are fit on different versions of the training dataset and the predictions from the trees are combined using simple voting for classification or averaging for regression to result in a better prediction than fitting a single decision tree.

There are some important differences; they are:

* Instances in the training set are assigned a weight based on difficulty.
* Learning algorithms must pay attention to instance weights.
* Ensemble members are added sequentially.

The first difference is that the **same training dataset is used to train each decision tree.** No sampling of the training dataset is performed. 

Instead, each example in the training dataset **(each row of data) is assigned a weight based on how easy or difficult the ensemble finds that example to predict.** This means that rows that are easy to predict using the ensemble have a small weight and rows that are difficult to predict correctly will have a much larger weight.

The second difference from bagging is that the base learning algorithm, e.g. the decision tree, must **pay attention to the weightings of the training dataset.** 

Finally, the boosting ensemble is constructed slowly. Ensemble **members are added sequentially, one, then another, and so on until the ensemble has the desired number of members.**

Importantly, the **weighting of the training dataset is updated based on the capability of the entire ensemble** after each ensemble member is added. This ensures that each member that is subsequently added works hard to correct errors made by the whole model on the training dataset.

The contribution of each model to the **final prediction is a weighted sum** of the performance of each model, e.g. a weighted average or weighted vote.

This incremental addition of ensemble members to correct errors on the training dataset sounds like it would eventually overfit the training dataset. In practice, boosting ensembles **can overfit the training dataset, but often, the effect is subtle and overfitting is not a major problem.**

**Number of base models could range from hundreds to thousands**. It's often a good idea to try multiple ones and see the result. 

## Choosing the Base Model

It's recommended that the base model be **a weak model.** Some of the reasons include:
* **Training time and computational resources.** Weak models can be trained and become good enough and are usually fast. 
* Performance gain is **not much for strong models,** so ensemble might not be needed in the first place.
* Making a too strong classifier may lead to some breaches in convergence (i.e. a lucky prediction may make the next iteration to predict pure noise and thus decrease performance), but this is usually repaired in proceeding iterations.
* Strong learners may lead to **overfitting the data.**
* If we chose a simple / trivial base, such as decision stump, we can use **number of iterations to control the model complexity** easily, and if we increase number of iterations, we still will have a sufficient complex model.

However, it's **perfectly okay to use strong models.** Just keep these precautions above in mind.

# AdaBoost

AdaBoost (typically) combines the predictions from short one-level decision trees, called decision stumps, although other algorithms can also be used. Decision stump algorithms are used as the AdaBoost algorithm seeks to use many weak models and correct their predictions by adding additional weak models.

The training algorithm involves starting with one decision tree, finding those examples in the training dataset that were misclassified, and adding more weight to those examples. Another tree is trained on the same data, although now weighted by the misclassification errors. This process is repeated until a desired number of trees are added.

The algorithm was developed for classification and involves combining the predictions made by all decision trees in the ensemble. A similar approach was also developed for regression problems where predictions are made by using the average of the decision trees. The contribution of each model to the ensemble prediction is weighted based on the performance of the model on the training dataset.

## Hyperparameters

AdaBoost has a number of hyperparameters to train and customize. The important ones are:

### Number of Estimators

An important hyperparameter for AdaBoost algorithm is the number of estimators (or trees in here) used in the ensemble.

Recall that each decision tree used in the ensemble is usually designed to be a weak learner. That is, it has skill over random prediction, but is not highly skillful. As such, one-level decision trees are used, called decision stumps.

The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.

The number of trees can be set via the “n_estimators” argument and defaults to 50.

We can usually see that performance **improves until some increase in value and declines after that.** This might be a sign of the ensemble overfitting the training dataset after additional trees are added.

### Strength of the Learner

A decision tree with one level is used as the weak learner by default.

We can make the models used in the ensemble less weak (more skillful) by increasing the depth of the decision tree.

We can **sometimes see that performance improves on a dataset until some increase in value and declines after that.**

This highly depends on the base model. 

For this, you choose your model beforehand and pass it to "base_estimator." For eg: 
```python
 # define base model
base = DecisionTreeClassifier(max_depth=i)
 # define ensemble model
models[str(i)] = AdaBoostClassifier(base_estimator=base)
```

### Learning Rate

AdaBoost also supports a learning rate that controls the contribution of each model to the ensemble prediction.

This is controlled by the “learning_rate” argument and by default is set to 1.0 or full contribution. Smaller or larger values might be appropriate depending on the number of models used in the ensemble. There is a **balance between the contribution of the models and the number of trees in the ensemble.**

**More trees may require a smaller learning rate; fewer trees may require a larger learning rate.** It is common to use values between 0 and 1 and sometimes **very small values to avoid overfitting** such as 0.1, 0.01 or 0.001.

### Alternate Base Models

The default algorithm used in the ensemble is a decision tree, although other algorithms can be used.

The intent is to use very simple models, called weak learners. Also, the scikit-learn implementation requires that any models used must also support weighted samples, as they are how the ensemble is created by fitting models based on a weighted version of the training dataset.

The base model can be specified via the “base_estimator” argument.

# Gradient Boosting Machine

Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.

Gradient boosting is a generalization of AdaBoosting, improving the performance of the approach and introducing ideas from bootstrap aggregation to further improve the models, such as randomly sampling the samples and features when fitting ensemble members.

Models are fit using any arbitrary **differentiable loss function and gradient descent optimization algorithm**. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network. 

**Naive gradient boosting is a greedy algorithm and can overfit the training dataset** quickly.

It can **benefit from regularization methods that penalize** various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

There are three types of enhancements to basic gradient boosting that can improve performance:

* Tree Constraints: such as the depth of the trees and the number of trees used in the ensemble.
* Weighted Updates: such as a learning rate used to limit how much each tree contributes to the ensemble.
* Random sampling: such as fitting trees on random subsets of features and samples.

The use of random sampling often leads to a change in the name of the algorithm to “stochastic gradient boosting.”

**Randomness is used in the construction of the model.** This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to **evaluate them by averaging their performance across multiple runs or repeats of cross-validation**. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Here's how to use it as a classifier: 

```python
 # define the model
model = GradientBoostingClassifier()
 # define the evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
 # evaluate the model on the dataset
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
 # report performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
```

## Hyperparameters

### Number of Trees

An important hyperparameter for the Gradient Boosting ensemble algorithm is the number of estimators (which is a tree) used in the ensemble.

Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, **more trees is often better. The number of trees must also be balanced with the learning rate**, e.g. more trees may require a smaller learning rate, fewer trees may require a larger learning rate.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

### Number of Samples

The number of samples used to fit each tree can be varied. This means that each tree is fit on a randomly selected subset of the training dataset.

Using **fewer samples introduces more variance for each tree, and could improve the overall performance** of the model.

The number of samples used to fit each tree is specified by the “subsample” argument and can be set to a fraction of the training dataset size. By default, it is set to 1.0 to use the entire training dataset.

### Number of Features

The number of features used to fit each decision tree can be varied.

Like changing the number of samples, changing the number of features **introduces additional variance into the model, which may improve performance, although it might require an increase in the number of trees.**

The number of features used by each tree is taken as a random sample and is specified by the “max_features” argument and defaults to all features in the training dataset.

### Learning Rate or Shrinkage

Learning rate controls the amount of contribution that each model has on the ensemble prediction.

**Smaller rates may require more decision trees in the ensemble, whereas larger rates may require an ensemble with fewer trees.** It is common to explore learning rate values on a log scale, such as between a very small value like 0.0001 and 1.0.

The **learning rate, also called shrinkage, can be set to smaller values to reduce overfitting**. This slowes down the rate of learning with the increase of the number of models used in the ensemble and in turn reduce the effect of overfitting.

The learning rate can be controlled via the “learning_rate” argument and defaults to 0.1.

### General Tips

In the 1999 paper “Greedy Function Approximation: A Gradient Boosting Machine“, Jerome Friedman:

**First set a large value for the number of trees, then tune the shrinkage parameter to achieve the best results.** Studies in the paper preferred a shrinkage value of 0.1, a number of trees in the range 100 to 500 and the number of terminal nodes in a tree between 2 and 8. 

Friedman introduces and empirically investigates stochastic gradient boosting (row-based sub-sampling). He finds that almost all **subsampling percentages are better than deterministic boosting** and perhaps 30%-to-50% is a good value to choose on some problems and 50%-to-80% on others.

He also studied the effect of the number of terminal nodes (or leaf nodes) in trees finding that small values like 3 and 6 better than larger values like 11, 21 and 41.

# XGBoost

Extreme Gradient Boosting, or XGBoost for short is an efficient open-source implementation of the gradient boosting algorithm.

The two main reasons to use XGBoost are execution speed and model performance.

Generally, XGBoost is fast when compared to other implementations of gradient boosting.

It isn't included in the sklearn library. However, the xgboost library does have a sklearn API which pretty much works as one and can be used with it.

We will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation

Here's how to use it in python:

```python
 # evaluate xgboost algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier  # Make sure it's the stable version
 # define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
 # define the model
model = XGBClassifier()
 # evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
 # report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
```

## Hyperparameters

### Number of Trees

An important hyperparameter for the XGBoost ensemble algorithm is the number of decision trees used in the ensemble.

Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees is often better. Although too much of them would eventually degrade because of overfitting. 

The number of trees can be set via the “n_estimators” argument and defaults to 100.

### Tree Depth

Varying the depth of each tree added to the ensemble is another important hyperparameter for gradient boosting.

The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).

Gradient boosting generally performs well with trees that have a modest depth, finding a balance between skill and generality.

Tree depth is controlled via the “max_depth” argument and defaults to 6. As in everything, performance improves uptill a point and then goes down afterwards.

### Learning Rate

Learning rate controls the amount of contribution that each model has on the ensemble prediction.

Smaller rates may require more decision trees in the ensemble.

The learning rate can be controlled via the “eta” argument and defaults to 0.3.

Good values for learning rate and tend to be between 0.0001 and 1.0.

### Number of Samples

The number of samples used to fit each tree can be varied. This means that each tree is fit on a randomly selected subset of the training dataset.

Using **fewer samples introduces more variance for each tree, and might improve the overall performance of the model.**

The number of samples used to fit each tree is specified by the “subsample” argument and can be set to a fraction of the training dataset size. By default, it is set to 1.0 to use the entire training dataset.

### Number of Features

The number of features used to fit each decision tree can be varied.

Like changing the number of samples, **reducing the number of features introduces additional variance into the model, which may improve performance, although it might require an increase in the number of trees.**

The number of features used by each tree is taken as a random sample and is specified by the “colsample_bytree” argument and defaults to all features in the training dataset, e.g. 100 percent or a value of 1.0. You can also sample columns for each split, and this is controlled by the “colsample_bylevel” argument, but we will not look at this hyperparameter here.

# LightGBM

LightGBM is another optimized Gradient Boosting Machine. It was developed by Microsoft. 

LightGBM tends to be extremely fast and good at predictions. It's sometimes outperformed by CatBoost in speed. and may be outperformed in performance (aka metric such as CV score) by XGBoost or CatBoost. 

A good idea would be try it and see where it goes.

Here's how to do it in python:

```python
 # lightgbm for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
 # define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
 # evaluate the model
model = LGBMRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
 # fit the model on the whole dataset
model = LGBMRegressor()
model.fit(X, y)
 # make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])
```

The hyperparameters are going to be the similar with the other boosting algorithms. Check out the specific docs for example.

# CatBoost

CatBoost is another optimized Gradient Boosting Machine. It was developed by Yandex. 

CatBoost tends to be extremely fast and good at predictions. It's often outperformed in speed by LightGBM and sometimes outperformed in performance (aka metrics such as CV score) by XGBoost or LightGBM. 

A good idea would be try it and see where it goes.

Here's how to do it in python:

```python
 # catboost for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
 # define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
 # evaluate the model
model = CatBoostClassifier(verbose=0, n_estimators=100)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
 # fit the model on the whole dataset
model = CatBoostClassifier(verbose=0, n_estimators=100)
model.fit(X, y)
 # make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])
```

The hyperparameters are going to be the similar with the other boosting algorithms. Check out the specific docs for example.