# Concepts

1. Ensemble learning
1. bootstrap aggregation
1. gradient descent
1. Boosting, Boosting Trees
1. Gradient Boosting, gradient boosted decision trees
1. XGBoost


# Resources


1. [A Beginner’s guide to XGBoost
](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7)
1. [XGBoost: Everything You Need to Know
](https://neptune.ai/blog/xgboost-everything-you-need-to-know#:~:text=XGBoost%20is%20a%20popular%20gradient,it's%20very%20easy%20to%20use.)
1. [Ensemble methods: bagging, boosting and stacking
](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)

# Ensemble algorithms


## Introduction

“Unity is strength”. This old saying expresses pretty well the underlying idea that rules the very powerful “ensemble methods” in machine learning. Roughly, ensemble learning methods, that often trust the top rankings of many machine learning competitions (including Kaggle’s competitions), are based on the hypothesis that combining multiple models together can often produce a much more powerful model.


**Ensemble learning** combines several learners (models) to improve overall performance, increasing predictiveness and accuracy in machine learning and predictive modeling.

Technically speaking, the power of ensemble models is simple: it can combine thousands of smaller learners trained on subsets of the original data. This can lead to interesting observations that, like:

- The variance of the general model decreases significantly thanks to **bagging**
- The bias also decreases due to **boosting** 
- And overall predictive power improves because of **stacking**  

In the first section of this post we will present the notions of weak and strong learners and we will introduce three main ensemble learning methods: bagging, boosting and stacking. Then, in the second section we will be focused on bagging and we will discuss notions such that bootstrapping, bagging and random forests. In the third section, we will present boosting and, in particular, its two most popular variants: adaptative boosting (adaboost) and gradient boosting. Finally in the fourth section we will give an overview of stacking.


## What are ensemble methods?


Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

### Single weak learner

In machine learning, no matter if we are facing a classification or a regression problem, the choice of the model is extremely important to have any chance to obtain good results. This choice can depend on many variables of the problem: quantity of data, dimensionality of the space, distribution hypothesis…

A low bias and a low variance, although they most often vary in opposite directions, are the two most fundamental features expected for a model. Indeed, to be able to “solve” a problem, we want our model to have enough degrees of freedom to resolve the underlying complexity of the data we are working with, but we also want it to have not too much degrees of freedom to avoid high variance and be more robust. This is the well known **bias-variance tradeoff.**

<img width="763" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/bb83ff9c-234d-4320-9494-c5b255ad3d2c">


In ensemble learning theory, we call **weak learners** (or **base models**) models that can be used as building blocks for designing more complex models by combining several of them. Most of the time, these basics models perform not so well by themselves either because they have a high bias (low degree of freedom models, for example) or because they have too much variance to be robust (high degree of freedom models, for example). Then, the idea of ensemble methods is to try reducing bias and/or variance of such weak learners by combining several of them together in order to create a **strong learner** (or **ensemble model**) that achieves better performances.

### Combine weak learners

In order to set up an ensemble learning method, we first need to select our base models to be aggregated. Most of the time (including in the well known bagging and boosting methods) a **single base learning algorithm** is used so that we have homogeneous weak learners that are trained in different ways. The ensemble model we obtain is then said to be **“homogeneous”**. However, there also exist some methods that use different type of base learning algorithms: some heterogeneous weak learners are then combined into an “heterogeneous ensembles model”.

One important point is that ***our choice of weak learners should be coherent with the way we aggregate these models. If we choose base models with low bias but high variance, it should be with an aggregating method that tends to reduce variance whereas if we choose base models with low variance but high bias, it should be with an aggregating method that tends to reduce bias.***


This brings us to the question of how to combine these models. We can mention three major kinds of meta-algorithms that aims at combining weak learners:

1. **bagging,** that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process

1. **boosting,** that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy

1. **stacking,** that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

**Very roughly, we can say that bagging will mainly focus at getting an ensemble model with less variance than its components whereas boosting and stacking will mainly try to produce strong models less biased than their components (even if variance can also be reduced).**

In the following sections, we will present in details bagging and boosting (that are a bit more widely used than stacking and will allow us to discuss some key notions of ensemble learning) before giving a brief overview of stacking.

<img width="782" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/2383051e-d9d8-4e8b-9cf6-e84d89780100">

**Weak learners can be combined to get a model with better performances. The way to combine base models should be adapted to their types. Low bias and high variance weak models should be combined in a way that makes the strong model more robust whereas low variance and high bias base models better be combined in a way that makes the ensemble model less biased.**


## Types of ensemble methods

Ensemble methods can be classified into two groups based on how the sub-learners are generated:

1. **Sequential ensemble methods** – learners are generated sequentially. These methods use the dependency between base learners. Each learner influences the next one, likewise, a general paternal behavior can be deduced. A popular example of sequential ensemble algorithms is **AdaBoost**. 

1. **Parallel ensemble methods** – learners are generated in parallel. The base learners are created independently to study and exploit the effects related to their independence and reduce error by averaging the results. An example implementing this approach is **Random Forest.**


## Homogenous and heterogenous ML algorithms  

- Ensemble methods can use **homogeneous learners** (learners from the same family) or **heterogeneous learners** (learners from multiple sorts, as accurate and diverse as possible).

Generally speaking, homogeneous ensemble methods have a single-type base learning algorithm. The training data is diversified by assigning weights to training samples, but they usually leverage a single type base learner. 

Heterogeneous ensembles on the other hand consist of members having different base learning algorithms which can be combined and used simultaneously to form the predictive model. 

A general rule of thumb: 

- **Homogeneous ensembles** use the same feature selection with a variety of data and distribute the dataset over several nodes. Homogeneous Ensembles:
    - Ensemble algorithms that use bagging like Decision Trees Classifiers
    - Random Forests, Randomized Decision Trees

- **Heterogeneous ensembles** use different feature selection methods with the same data. Heterogeneous Ensembles:

    - Support Vector Machines, SVM
    - Artificial Neural Networks, ANN
    - Memory-Based Learning methods
    - Bagged and Boosted decision Trees like XGBoost


## Bagging

**Decrease overall variance** by averaging the performance of multiple estimates. Aggregate several sampling subsets of the original dataset to train different learners chosen randomly with replacement, which conforms to the core idea of bootstrap aggregation. Bagging normally uses a voting mechanism for **classification** (Random Forest) and averaging for **regression**.

<img width="977" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/99049aed-daef-463c-898a-e872c372b296">


- **Note:** Remember that some learners are stable and less sensitive to training perturbations. Such learners, when combined, don’t help the general model to improve generalization performance.


## Boosting

This technique matches weak learners — learners that have poor predictive power and do slightly better than random guessing — to a specific weighted subset of the original dataset. Higher weights are given to subsets that were misclassified earlier.

Learner predictions are then combined with voting mechanisms in case of classification or weighted sum for regression.


<img width="932" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/72e432c5-90a3-4b44-bcb0-9e8212cd2717">


#  Bagging

In **parallel methods** we fit the different considered learners independently from each others and, so, it is possible to train them concurrently. The most famous such approach is “bagging” (standing for “bootstrap aggregating”) that aims at producing an ensemble model that is more robust than the individual models composing it.


## Bootstrapping

Let’s begin by defining bootstrapping. This statistical technique consists in generating samples of size B (called bootstrap samples) from an initial dataset of size N by randomly drawing with replacement B observations.

<img width="675" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/79a2249d-48bf-4105-8047-4db2c39f997c">

Under some assumptions, these samples have pretty **good statistical properties**: in first approximation, they can be seen as being drawn both directly from the true underlying (and often unknown) data distribution and independently from each others. So, they can be considered as representative and independent samples of the true data distribution (almost i.i.d. samples). The hypothesis that have to be verified to make this approximation valid are twofold. First, the size N of the initial dataset should be large enough to capture most of the complexity of the underlying distribution so that sampling from the dataset is a good approximation of sampling from the real distribution (**representativity**). Second, the size N of the dataset should be large enough compared to the size B of the bootstrap samples so that samples are not too much correlated (**independence**). Notice that in the following, we will sometimes make reference to these properties (representativity and independence) of bootstrap samples: the reader should always keep in mind that **this is only an approximation.**

**Bootstrap samples are often used, for example, to evaluate variance or confidence intervals of a statistical estimators.** By definition, a statistical estimator is a function of some observations and, so, a random variable with variance coming from these observations. In order to estimate the variance of such an estimator, we need to evaluate it on several independent samples drawn from the distribution of interest. In most of the cases, considering truly independent samples would require too much data compared to the amount really available. We can then use bootstrapping to generate several bootstrap samples that can be considered as being **“almost-representative”** and **“almost-independent”** (**almost i.i.d. samples**). These bootstrap samples will allow us to approximate the variance of the estimator, by evaluating its value for each of them.


<img width="1207" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/6e3a519f-dd3f-463f-96fe-4f317c6def56">

***Bootstrapping is often used to evaluate variance or confidence interval of some statistical estimators.***

## Bagging in Detail

When training a model, no matter if we are dealing with a classification or a regression problem, we obtain a function that takes an input, returns an output and that is defined with respect to the training dataset. Due to the theoretical variance of the training dataset (we remind that a dataset is an observed sample coming from a true unknown underlying distribution), the fitted model is also subject to variability: **if another dataset had been observed, we would have obtained a different model.**

*The idea of bagging is then simple: we want to fit several independent models and “average” their predictions in order to obtain a model with a lower variance. However, we can’t, in practice, fit fully independent models because it would require too much data. So, we rely on the good “approximate properties” of bootstrap samples (representativity and independence) to fit models that are almost independent.*


First, we create multiple bootstrap samples so that each new bootstrap sample will act as another (almost) independent dataset drawn from true distribution. Then, we can **fit a weak learner for each of these samples and finally aggregate them such that we kind of “average” their outputs and, so, obtain an ensemble model with less variance that its components. Roughly speaking, as the bootstrap samples are approximatively independent and identically distributed (i.i.d.), so are the learned base models.** Then, “averaging” weak learners outputs do not change the expected answer but reduce its variance (just like averaging i.i.d. random variables preserve expected value but reduce variance).

<img width="1498" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/05ac7b00-06b4-469c-ab89-91a52c988c43">

There are several possible ways to aggregate the multiple models fitted in parallel. For a regression problem, the outputs of individual models can literally be averaged to obtain the output of the ensemble model. For classification problem the class outputted by each model can be seen as a vote and the class that receives the majority of the votes is returned by the ensemble model (this is called **hard-voting**). Still for a classification problem, we can also consider the probabilities of each classes returned by all the models, average these probabilities and keep the class with the highest average probability (this is called **soft-voting**). Averages or votes can either be simple or weighted if any relevant weights can be used.

Finally, we can mention that one of the big advantages of bagging is that it can be parallelised. As the different models are fitted independently from each others, intensive parallelisation techniques can be used if required.


<img width="1322" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/91a4fa5b-491e-440a-afba-97515d2d403a">

**Bagging consists in fitting several base models on different bootstrap samples and build an ensemble model that “average” the results of these weak learners.**

## Random forests

Learning trees are very popular base models for ensemble methods. Strong learners composed of multiple trees can be called “forests”. Trees that compose a forest can be chosen to be either shallow (few depths) or deep (lot of depths, if not fully grown). Shallow trees have less variance but higher bias and then will be better choice for sequential methods that we will described thereafter. Deep trees, on the other side, have low bias but high variance and, so, are relevant choices for bagging method that is mainly focused at reducing variance.

The **random forest** approach is a bagging method where **deep trees**, fitted on bootstrap samples, are combined to produce an output with lower variance. However, random forests also use another trick to make the multiple fitted trees a bit less correlated with each others: when growing each tree, instead of only sampling over the observations in the dataset to generate a bootstrap sample, we also **sample over features** and keep only a random subset of them to build the tree.

Sampling over features has indeed the effect that all trees do not look at the exact same information to make their decisions and, so, it reduces the correlation between the different returned outputs. Another advantage of sampling over the features is that **it makes the decision making process more robust to missing data:** observations (from the training dataset or not) with missing data can still be regressed or classified based on the trees that take into account only features where data are not missing. Thus, random forest algorithm combines the concepts of bagging and random feature subspace selection to create more robust models.

<img width="1388" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/8cebc7d6-f837-4c89-962c-fec3c3849435">


***Random forest method is a bagging method with trees as weak learners. Each tree is fitted on a bootstrap sample considering only a subset of variables randomly chosen.***

# Boosting, Boosting Trees

## Introduction

With a regular machine learning model, like a decision tree, we’d simply train a single model on our dataset and use that for prediction. We might play around with the parameters for a bit or augment the data, but in the end we are still using a single model. Even if we build an ensemble, all of the models are trained and applied to our data separately.

Boosting, on the other hand, takes a more iterative approach. It’s still technically an ensemble technique in that many models are combined together to perform the final one, but takes a more clever approach.

***Rather than training all of the models in isolation of one another, boosting trains models in succession, with each new model being trained to correct the errors made by the previous ones.*** Models are added sequentially until no further improvements can be made.

The advantage of this iterative approach is that the new models being added are focused on correcting the mistakes which were caused by other models. **In a standard ensemble method where models are trained in isolation, all of the models might simply end up making the same mistakes!**


**Focus on boosting**

In **sequential methods** the different combined weak models are no longer fitted independently from each others. The idea is to fit models iteratively such that the training of model at a given step depends on the models fitted at the previous steps. “Boosting” is the most famous of these approaches and it produces an ensemble model that is in general less biased than the weak learners that compose it.

## Boosting in Detail

Boosting methods work in the same spirit as bagging methods: we build a family of models that are aggregated to obtain a strong learner that performs better. However, unlike bagging that mainly aims at reducing variance, boosting is a technique that consists in fitting sequentially multiple weak learners in a very adaptative way: each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence. Intuitively, each new model **focus its efforts on the most difficult observations to fit up to now,** so that we obtain, at the end of the process, a strong learner with lower bias (even if we can notice that boosting can also have the effect of reducing variance). Boosting, like bagging, can be used **for regression as well as for classification problems**.

Being **mainly focused at reducing bias**, the base models that are often considered for boosting are models with low variance but high bias. For example, if we want to use trees as our base models, we will choose most of the time shallow decision trees with only a few depths. Another important reason that motivates the use of low variance but high bias models as weak learners for boosting is that these models are in general less computationally expensive to fit (few degrees of freedom when parametrised). Indeed, as computations to fit the different models **can’t be done in parallel** (unlike bagging), it could become too expensive to fit sequentially several complex models.


Once the weak learners have been chosen, we still need to define:

1. how they will be sequentially fitted (what information from previous models do we take into account when fitting current model?) and

1. how they will be aggregated (how do we aggregate the current model to the previous ones?). 

We will discuss these questions in the two following subsections, describing more especially **two important boosting algorithms: adaboost and gradient boosting.**

In a nutshell, these two meta-algorithms differ on how they create and aggregate the weak learners during the sequential process. **Adaptive boosting updates the weights attached to each of the training dataset observations whereas gradient boosting updates the value of these observations.** This main difference comes from the way **both methods try to solve the optimisation problem of finding the best model that can be written as a weighted sum of weak learners.**


<img width="1356" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/49eefd19-3014-4ebe-8d30-0c181c9b5df0">


***Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble model and “update” the training dataset to better take into account the strengths and weakness of the current ensemble model when fitting the next base model.***


## Well-known boosting algorithms 


### AdaBoost

AdaBoost stands for **Adaptive Boosting**. The logic implemented in the algorithm is: 

1. First-round classifiers (learners) are all trained using weighted coefficients that are equal,

1. In subsequent boosting rounds the adaptive process increasingly weighs data points that were misclassified by the learners in previous rounds and decrease the weights for correctly classified ones. 

If you’re curious about the algorithm’s description, take a look at this:

<img width="723" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/525b6992-4b71-4de7-9ff2-86ec96077054">


In adaptative boosting (often called “adaboost”), we try to define our ensemble model as a weighted sum of L weak learners

<img width="779" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/0db229cd-6c85-4ee5-a690-e055d01b348f">

Finding the best ensemble model with this form is a difficult optimisation problem. Then, instead of trying to solve it in one single shot (finding all the coefficients and weak learners that give the best overall additive model), we make use of an iterative optimisation process that is much more tractable, even if it can lead to a sub-optimal solution. More especially, we add the weak learners one by one, looking at each iteration for the best possible pair (coefficient, weak learner) to add to the current ensemble model. In other words, we define recurrently the (s_l)’s such that


<img width="696" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/f4f16536-ae2d-4b74-853f-cdab38ff208a">

where c_l and w_l are chosen such that s_l is the model that fit the best the training data and, so, that is the best possible improvement over s_(l-1). We can then denote

<img width="848" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/280a4912-d2f9-4de1-8e67-a5c9796ec8ac">

where E(.) is the fitting error of the given model and e(.,.) is the loss/error function. Thus, **instead of optimising “globally” over all the L models in the sum, we approximate the optimum by optimising “locally” building and adding the weak learners to the strong model one by one.**

More especially, when considering a binary classification, we can show that the adaboost algorithm can be re-written into a process that proceeds as follow. 

1. First, it **updates the observations weights** in the dataset and train a new weak learner with a special focus given to the observations misclassified by the current ensemble model. 

1. Second, it **adds the weak learner to the weighted sum according to an update coefficient that expresse the performances of this weak model:** the better a weak learner performs, the more it contributes to the strong learner.

So, assume that we are facing a binary classification problem, with N observations in our dataset and we want to use adaboost algorithm with a given family of weak models. At the very beginning of the algorithm (first model of the sequence), all the observations have the same weights 1/N. Then, we repeat L times (for the L learners in the sequence) the following steps:

- fit the best possible weak model with the current observations weights

- compute the value of the update coefficient that is some kind of scalar evaluation metric of the weak learner that indicates how much this weak learner should be taken into account into the ensemble model

- update the strong learner by adding the new weak learner multiplied by its update coefficient

- compute new observations weights that expresse which observations we would like to focus on at the next iteration (weights of observations wrongly predicted by the aggregated model increase and weights of the correctly predicted observations decrease)

- **Repeating these steps, we have then build sequentially our L models and aggregate them into a simple linear combination weighted by coefficients expressing the performance of each learner.** Notice that there exists variants of the initial adaboost algorithm such that **LogitBoost (classification)** or **L2Boost (regression)** that mainly differ by their choice of **loss function**.

<img width="1288" alt="image" src="https://github.com/eraikakou/LLMs-News/assets/28102493/60b9e337-8723-4080-b1ef-fcb2e63c38ac">

***Adaboost updates weights of the observations at each iteration. Weights of well classified observations decrease relatively to weights of misclassified observations. Models that perform better have higher weights in the final ensemble model.***

### Gradient Boosting methods

Gradient Boosting specifically is an approach where new models are trained to predict the residuals (i.e errors) of prior models. I’ve outlined the approach in the diagram below.


<img width="787" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/20f21c25-597e-4170-9fcd-0d7a0817696e">

**Gradient Boosting uses differentiable function losses from the weak learners to generalize. At each boosting stage, the learners are used to minimize the loss function given the current model.** Boosting algorithms can be used either for **classification or regression**. 


<img width="550" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/86fd9b7a-5594-4c02-b32b-88c878ba407d">


#### XGBoost

XGBoost stands for Extreme Gradient Boosting. It’s a parallelized and carefully optimized version of the gradient boosting algorithm. Parallelizing the whole boosting process hugely improves the training time. 

Instead of training the best possible model on the data (like in traditional methods), we train thousands of models on various subsets of the training dataset and then vote for the best-performing model.

For many cases, XGBoost is better than usual gradient boosting algorithms. The Python implementation gives access to a vast number of inner parameters to tweak for better precision and accuracy.

Some important features of XGBoost are:

- **Parallelization:** The model is implemented to train with multiple CPU cores.

- **Regularization:** XGBoost includes different regularization penalties to avoid overfitting. Penalty regularizations produce successful training so the model can generalize adequately.

- **Non-linearity:** XGBoost can detect and learn from non-linear data patterns.

- **Cross-validation:** Built-in and comes out-of-the-box.

- **Scalability:** XGBoost can run distributed thanks to distributed servers and clusters like Hadoop and Spark, so you can process enormous amounts of data. It’s also available for many programming languages like C++, JAVA, Python, and Julia. 


#### Gradient Boosting Machine (GBM)

**GBM** combines predictions from multiple decision trees, and all the weak learners are decision trees. The key idea with this algorithm is that every node of those trees takes a different subset of features to select the best split. As it’s a Boosting algorithm, each new tree learns from the errors made in the previous ones.


#### Light Gradient Boosting Machine (LightGBM)

**LightGBM** can handle huge amounts of data. It’s one of the fastest algorithms for both training and prediction. It generalizes well, meaning that it can be used to solve similar problems. It scales well to large numbers of cores and has an open-source code so you can use it in your projects for free.

#### Categorical Boosting (CatBoost)

This particular set of Gradient Boosting variants has specific abilities to handle categorical variables and data in general. The **CatBoost** object can handle categorical variables or numeric variables, as well as datasets with mixed types. That’s not all. It can also use unlabelled examples and explore the effect of kernel size on speed during training.

# XGBoost

## Introduction

XGBoost is an open source library providing a high-performance implementation of **gradient boosted decision trees**. An underlying **C++ codebase combined with a Python interface** sitting on top makes for an extremely powerful yet easy to implement package. XGBoost is a popular gradient-boosting library for GPU training, distributed computing, and parallelization. It’s precise, it adapts well to all types of data and problems, it has excellent documentation, and overall it’s very easy to use. 

The performance of XGBoost is no joke — it’s become the go-to library for winning many Kaggle competitions. Its gradient boosting implementation is second to none and there’s only more to come as the library continues to garner praise.

At the moment it’s the de facto standard algorithm for getting accurate results from predictive modeling with machine learning. It’s the fastest gradient-boosting library for R, Python, and C++ with very high accuracy.


## Getting started with XGBoost


1. **Step 1:** In order for XGBoost to be able to use our data, we’ll need to transform it into a specific format that XGBoost can handle. That format is called `DMatrix`. It’s a very simple one-linear to transform a numpy array of data to DMatrix format:
    - `D_train = xgb.DMatrix(X_train, label=Y_train)`
    - `D_test = xgb.DMatrix(X_test, label=Y_test)`

1. **Step 2 - Defining an XGBoost model:** Now that our data is all loaded up, we can define the parameters of our gradient boosting ensemble. We’ve set up some of the most important ones below to get us started. For more complicated tasks and models, the full list of possible parameters is available on the official XGBoost website. The simplest parameters are the:

    - `max_depth` (maximum depth of the decision trees being trained), 
    - `objective` (the loss function being used), and 
    - `num_class` (the number of classes in the dataset). 
    - The `eta` algorithm requires special attention. From our theory, Gradient Boosting involves creating and adding decision trees to an ensemble model sequentially. New trees are created to correct the residual errors in the predictions from the existing ensemble. **Due to the nature of an ensemble, i.e having several models put together to form what is essentially a very large complicated one, makes this technique prone to overfitting. The eta parameter gives us a chance to prevent this overfitting.** The eta can be thought of more intuitively as a learning rate. Rather than simply adding the predictions of new trees to the ensemble with full weight, the eta will be multiplied by the residuals being adding to reduce their weight. This effectively reduces the complexity of the overall model. It is common to have small values in the range of 0.1 to 0.3. The smaller weighting of these residuals will still help us train a powerful model, but won’t let that model run away into deep complexity where overfitting is more likely to happen.
    - `steps` The number of training iterations
    - But there are some more cool features that’ll help you get the most out of your models. The `gamma` parameter can also help with controlling overfitting. It specifies the minimum reduction in the loss required to make a further partition on a leaf node of the tree. I.e if creating a new node doesn’t reduce the loss by a certain amount, then we won’t create it at all.
    - The `booster` parameter allows you to set the type of model you will use when building the ensemble. The default is gbtree which builds an ensemble of decision trees. If your data isn’t too complicated, you can go with the faster and simpler gblinear option which builds an ensemble of linear models.

1. **Step 3 - Grid Search:** Setting the optimal hyperparameters of any ML model can be a challenge. So why not let Scikit Learn do it for you? We can combine Scikit Learn’s grid search with an XGBoost classifier quite easily. **Only do that on a big dataset if you have time to kill — doing a grid search is essentially training an ensemble of decision trees many times over!**

1. **Step 4 - Training and Testing:**  We can finally train our model similar to how we do so with Scikit Learn: `model = xgb.train(param, D_train, steps)`


```python

from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train, Y_train)
```

## How does the XGBoost algorithm work?

- Consider a function or estimate . To start, we build a sequence derived from the function gradients. The equation below models a particular form of gradient descent. The represents the Loss function to minimize hence it gives the direction in which the function decreases. is the rate of change fitted to the loss function, it’s equivalent to the learning rate in gradient descent. is expected to approximate the behaviour of the loss suitably.
    
    <img width="424" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/8d409265-2140-4ccb-8152-198ba869da7f">
    
- To iterate over the model and find the optimal definition we need to express the whole formula as a sequence and find an effective function that will converge to the minimum of the function. This function will serve as an error measure to help us decrease the loss and keep the performance over time. The sequence converges to the minimum of the function . This particular notation defines the error function that applies when evaluating a gradient boosting regressor. 
    
    <img width="530" alt="image" src="https://github.com/eraikakou/ml-theory/assets/28102493/5a6a2353-333c-42ba-9d99-93c71873d3c4">


