# Ensemble Learning and Random forest

An ensemble is a collection of models that are trained one by one and their predictions are aggregated and averaged. This approach produce very accurate
predictions but training multiples models will be is computationaly heavy. By extension a random forest is an ensemble is a collection of decision trees. 
In this chapter we will examine decision and other ensemble classifiers as well.

## Voting Classifier

The concept is the simple. We train a set of predictor the result that appears the most is elected and outputed(this is also called
_hard voting classifier_). Note that make this method reliable we have to use multiple different models and independant from one another. Let's look at an  
example using the moon dataset:

In [1]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(estimators=[
    ("lr", LogisticRegression(random_state=42)),
    ("rf", RandomForestClassifier(random_state=42)),
    ("svc", SVC(random_state=42))
])

voting_clf.fit(X_train, y_train)
# Let's look at its performance
print(voting_clf.score(X_test, y_test))

0.912


If all the classifier can estimate probabilities, we can ask scikit-learn to generate all the classes with the highest probabilities(this is called 
_soft voting_). Here is the implementation of such classifier:

In [2]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test) # As we can see the perfomances are better.

0.92

## Bagging and pasting

Another way to enhance accuracy of random forest is to use the same type of models and train each one of them on a random subset of the training set. When
the instances of the subsets are not overlapped this is called _pasting_ otherwise it is called _bagging_. This method is very efficient because models 
can be trained in paralel in different CPU or GPU cores. How is an implementation of bagging:

In [3]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train) # If the models can output probabilities then it automatically use soft voting

## Out-of-bag evaluation

While bagging the instances that are used constitute around 87% of the dataset, the remaining 23% are never used(this instances vary at every training of
course). This instances are called out-of-bag instances. We can use them to validate the data eliminating the need for a separate validation set. In 
scikit-learn we can do this by _oob\_score=True_.

In [5]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
print(bag_clf.oob_score_) #This means that the model is likely to perform at 89% accuracy in the test set.
# Let's verify this
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_pred, y_test)) # Close

0.896
0.92


## Random Forests

A random forest is a ensemble of decision trees trained using bagging by default (or pasting sometimes) 

In [7]:
from sklearn.ensemble import RandomForestRegressor

rnd_clf = RandomForestRegressor(n_estimators=5, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
print(y_pred_rf[0])

0.4258333333333334


## AdaBoost (Adaptative Boosting)

Another way to train ensembles is to train a base classifier, then we increase the weights of the instances that were misclassified and so on... .  Once 
all the predictors are trained, the ensemble output its prediction very much like bagging or pasting but all the predictors have different weights(this
means that the models cannot be trained parallely). The weighted error rate of the _$j^{th}$_ predictor is:
$$r_j = \sum_{i=1}^{m}w^{(i)} => \hat{y}_j^{(i)} \neq y^{(i)}$$
- $w_{(i)}$ is the weigth of the $i_{th} instance (set initially to 1/m)
The more accurate a predictor is the higher its weight is. The weight of a predictor is calculated as follows:
$$\alpha _j = \eta \log \frac{1 - r_j}{r_j}$$
Finally to make its prediction, adaBoost computes all the predictors, the class that is predicted is the one that receives the majority of the weighted
votes. 

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    learning_rate=0.5,
    random_state=42
)
ada_clf.fit(X_train, y_train)

To regularize the model, one approach is to reduce the number of estimators.

## Gradient boosting

This functiont the same as ada boosting but instead of adding weights to a predictor after its training, this method try to fit a new model to the residual
errors of the previous model. Here is an example using regression:

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

# And here is a second regressor that is trained on the residual errors of the first
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

# And a third
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

# Now we that we have an ensemble of 3 trees we make new predictions by simply adding all of them together
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

It is possible to do the same thing using the _GradientBoostingRegressor()_ class:

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.5, random_state=42)
gbr.fit(X, y)

## Histogram-Based Gradient Boosting

Histogram-Based Gradient Boosting (HGB) is an optimization of the standard Gradient Boosting approach. The key idea behind HGB is to make the training process faster and more memory-efficient by using histograms to approximate the data distribution during tree building. Here are the key concept:

- Binning:

 Instead of using the raw continuous feature values directly, HGB first groups (or "bins") these values into discrete intervals (bins). For example, if a feature's values range from 0 to 100, they might be grouped into 10 bins (e.g., 0-10, 10-20, etc.).
 Each feature's values are then replaced by their corresponding bin index, reducing the number of unique values and simplifying the process of finding the best splits.
- Histogram Construction:

 For each feature, a histogram is constructed by counting the number of instances that fall into each bin. This histogram provides a summarized view of the data distribution.
 During tree building, the algorithm uses these histograms to evaluate potential splits by examining the sum of gradients and Hessians (second-order derivatives of the loss function) within each bin.
- Finding Splits:

 The algorithm selects the best split points based on the histogram bins rather than the exact values, which significantly reduces the computational cost.
 After determining the best bin to split, the model can use the bin’s boundary to define the actual split point in the tree.
- Efficiency:

 Memory Efficiency: By representing features with bins, the algorithm reduces the memory footprint, as it no longer needs to store all individual feature values.
- Computation Efficiency: The process of finding splits is much faster because the algorithm needs to evaluate only a small number of bins instead of 
 potentially thousands or millions of unique values.
- Parallelism:

 The binning process allows for better parallelism, as different features or even different parts of the data can be binned simultaneously, making the algorithm well-suited for distributed computing environments.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder


hgb_reg = make_pipeline(make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]), remainder="passthrough"), HistGradientBoostingRegressor(categorical_features=[0], random_state=42))
hgb_reg.fit(X, y)

We do not need imputers, scaler etc... . 

## Stacking

Stacking, also known as stacked generalization, is an ensemble learning technique that combines multiple machine learning models to improve overall performance. Unlike other ensemble methods like bagging or boosting, where the models are combined in a more straightforward manner, stacking involves training a "meta-model" to aggregate the predictions of several "base models."

In [None]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(estimators=[('lr', LogisticRegression(random_state=42)), ('rf', RandomForestClassifier(random_state=42)), ('svc', SVC(probability=True, random_state=42))], final_estimator=RandomForestClassifier(random_state=43), cv=5)
stacking_clf.fit(X_train, y_train)

If we evaluate this stacking model on the test set, we would find 92.8% accuracy, which is a bit better than the voting classifier using soft voting, 
which got 92%.