<a href="https://colab.research.google.com/github/ajrianop/ML/blob/main/05_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Ensemble learning**

Ensemble learning involves combining models to achieve improved results, making it a powerful tool in machine learning.
Random Forest is an example of ensemble learning, where multiple decision trees are combined to enhance the model's performance and prevent overfitting.
In the context of K-Means clustering, we can utilize different K-means models initialized with various random centroids. Finally, we can aggregate the results through voting to determine the best final outcome.

Random Forest uses a technique called **bagging**, which stands for bootstrap aggregating. Bagging involves taking random sub-samples of the training data and feeding each sub-sample into different versions of the same model. These models then vote to determine the final result. In the context of Random Forest, many decision trees are trained on random samples of the training data, employing bagging to improve the model's performance.

**Boosting** is an alternative approach where subsequent models focus on addressing the areas misclassified by previous models. The idea is to start with a base model and then boost the attributes that are relevant to the misclassified instances. By iteratively training and adjusting models based on the weak points identified in the previous models, boosting aims to improve the overall accuracy.

**Bucket of models** involves training multiple models on the same training data and selecting the best-performing model as the final choice.

**Stacking**, on the other hand, runs multiple models and combines their results. However, unlike the bucket of models approach where only the winning model is selected, stacking combines the results of all the models to arrive at the final output. This combination of multiple models' predictions can provide improved performance and better decision-making.

Advanced ensemble learning:

* Bayes optimal classifier: Is theoretically the best, but it is impractical
Bayesian Parameter Averaging: suceptible to overfitting and often outperformed by the simpler bagging approach.

* Bayesian Model Combination: Try to solve all the shortcomings of Bayes optimal classifier and Bayesian Parameter Averaging.

* Difficult to do in practice.

# **XGBoost**

XGBoost means eXtreme Gradient Boosted trees. We remind that boosting is an ensemble method, which allows us to improve this attributes which are getting worst in our study.
XGBoost has several interesting features as:

* Regularized boosting, which prevents overfitting.
* We can handle missing values automatically. We don't have to think so much.
* The proccess running in parallel.
* We can do cross-validation in each iteration.
* This supports incremental training, so this stop the training of an XGBoost model save it and came back o pick up it later again.
* Can plug in your own optimization objectives.
* Tree pruning: We remind that pruning refers to the process of reducing the size of a decision tree by removing unnecessary branches and nodes. It is a technique used to prevent overfitting and improve the generalization ability of the tree.


Tuning Hyperparameter:
* Choose the booster:
 - gbtree
 - gblinear
* See the what is the objetive of the model:
 - multi: softmax
 - multi: softprob
* ETA: The biggest knob that we have in XGBoost, this is a learning rate which adjust weight on each step. The default value is 0.3, lower value generates better results. Weights
* Max_dept  depht of the tree: If the depth of the tree is niot large enough, you are not able to create a very accurate model, but if this is too large you probably lead to overfitting.
* Min_child_weight: use to control overfitting.
* look for more hyperparameters.


## **Example**

If we run code in a local storage, we have to install xgboost using `pip install xgboost`.
We are going to consider the iris dataset  fromo sklearn which  includes the width and length of the petals and sepals of many Iris flowers.


In [10]:
from sklearn.datasets import load_iris

iris = load_iris()
type_iris = type(iris)
type_iris

sklearn.utils._bunch.Bunch

In [11]:
import pandas as pd
# Create a DataFrame from the iris dataset
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [12]:
number_samples , number_features = iris.data.shape
print(f'The number of samples in the array iris is {number_samples} and the amount of features is {number_features}')
print(f'The information in the target is: {list(iris.target_names)}')

The number of samples in the array iris is 150 and the amount of features is 4
The information in the target is: ['setosa', 'versicolor', 'virginica']


We can observe several properties of the information provided in the Iris dataset. The dataset consists of an array with 150 samples, each representing an iris flower, and four features that describe each sample. Moreover, the information in the dataset is already classified, which means we have target values available. These target values are valuable for training our model and drawing various conclusions based on the dataset.
We can see that the kind of iris flowers that we have are:
* setosa
* versicolor
* virginica

Now, the objective is to train a model using the available data, specifically employing XGBoost. To begin the training process, it is essential to split the dataset into two distinct sets: a training set and a test set. The allocation of data between these sets can vary. One common approach is to assign 70% of the data to the training set and the remaining 30% to the test set. Alternatively, an 80% to 20% split can also be considered, where 80% of the data is utilized for training, and the remaining 20% is set aside for testing the model's performance. The choice of the specific split ratio depends on various factors, including the dataset's size and the desired evaluation strategy.

In [14]:
from sklearn.model_selection import train_test_split

# Splitting dataset: train = 70 % and test = 30 %
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)