# Ensemble Models

-- It's better to use models that are essentially different, as the weakness in one balances the other one.

-- If all models can estimate class probability, then you can set *voting=soft*. This often performs better because it gives more weight to confident scores.

-- Another **way to ensemble models is to train the same model with different samples from you training data**. Here we have *bagging (bootstrap aggregating, with replacament)* and *pasting (without replacement)*. Bagging often performs better, so it's preferred. They scale well because they can run in parallel.

-- **Random Patches Method**: Sample both training instances and features.

-- **Random Subspaces Method**: Sample features, keeps all training instances (i.e., bootstrap=False and max_samples=1.0)

-- The aggregation function is typically the mode (most frequent value) or the average (for regression) // HARD voting

-- Random Forests are great to examine Feature Importance, specially when you need to perform feature selection.

-- **Boosting** is taking several weak learners models, in a sequence, and improving their performance. Every new model has a small improvement from the model before it.
It does not scale well since it can't always run in parallel.  GradientBoosting is a boosting ensemble model, that picks up on the residual errors of the trees, in sequence. In contrary, ADA tweaks the weights after evert iteration.
A learning rate of 0.1 is low, but it usually learns well.

-- A good way to deal with a model taking a long time to learn (specially boosting with lower learning rates) is to perform **early stop**

-- If you set *warm_start=True*, Scikit learn keeps the existing trees when .fit is called, allowing incremental training.

-- Use XGBoost for scalability, portability

-- **Stacking** is when we train a model to perform the aggregation of values. Each model in the sequence generates an output. These outputs are combined for a final one (called a blender, or meta learner)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

logreg = LogisticRegression()
forest = RandomForestClassifier()
svm = SVC()

voting_classifier = VotingClassifier(
    estimators=[("linear", logreg), ("forest", forest), ("svm", svm)],
    voting="hard"
)

# fit the classifier
voting_classifier.fit(X_train,y_train)

# check each classifier accuracy on the test set
from sklearn.metrics import accuracy_score

for clf in (logreg, forest, svm, voting_classifier):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(score.__class__.__name__, accuracy_score(y_pred, y_test))

In [None]:
# Bagging and Pasting in Scikit-Learn

# the code below will take a X_train and use only 100 samples for each training
# it will probably generalise it better than a single decision tree

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_classifier = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, # if you change boostrap=False then you have Pasting, and not longer Bagging
    # n_jobs=-1 uses all the cores available
    oob_score= True # if you set this to true, you can use the score of "out of bag" as an evaluation, since these instances were not used to train the model
)

bag_classifier.fit(X_train, y_train)

y_pred = bag_classifier.predict(X_test)

# if you set oob_score=True you can get the score with
bag_classifier.oob_score_
# you can also check the decision function
bag_classifier.obb_decision_function_
# it will show for example, the probability of belonging to one class or another

## REVIEW

-- If you have trained five different models and they all achieve 95% precision, you can try combining them into a voting ensemble, which will often give you even better results. It works better if the models are very different (e.g., an SVM classifier, a Decision Tree classifier, a Logistic Regression classifier, and so on). It is even better if they are trained on different training instances (that's the whole point of bagging and pasting ensembles), but if not this will still be effective as long as the models are very different.

-- **A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability.** This gives high-confidence votes more weight and often performs better, but it works only if every classifier is able to estimate class probabilities (e.g., for the SVM classifiers in Scikit-Learn you must set probability=True).

-- It is quite possible to **speed up training of a bagging ensemble by distributing it across multiple servers**,** since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and Random Forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previous predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.

-- **With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained on (they were held out)**. This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.

-- When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for **Extra-Trees**, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they **use random thresholds for each feature**. This extra randomness acts like a form of regularization: if a Random Forest overfits the training data, Extra-Trees might perform better. Moreover, since Extra-Trees don't search for the best possible thresholds, they are much faster to train than Random Forests. However, they are neither faster nor slower than Random Forests when making predictions.

-- If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate.

-- If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right number of predictors (you probably have too many).

# Ensemble Methods: A Comparative Overview
Ensemble methods are powerful techniques in machine learning that combine multiple models to improve overall performance. Let's delve into the three primary types: Boosting, Bagging and Pasting, and Stacking.

## Boosting
Boosting is a sequential ensemble method where models are trained sequentially, with each model focusing on correcting the errors of the previous ones.

<img src="../img/ada_boosting.png" width="60%">

**Key Techniques:**

- **AdaBoost (Adaptive Boosting):** Assigns weights to training samples, giving more weight to misclassified samples in subsequent iterations.
- **Gradient Boosting:** Treats model training as an optimization problem, minimizing a loss function by iteratively adding weak learners.
- **XGBoost (eXtreme Gradient Boosting):** An efficient implementation of gradient boosting with various optimizations for speed and performance.
**LightGBM:** Another efficient gradient boosting framework that uses tree-based algorithms.

## Bagging and Pasting
Bagging and Pasting are ensemble methods that involve training multiple models independently on different subsets of the training data.

<img src="../img/bagging_pasting.png" width="60%">

**Key Techniques:**

**Bagging (Bootstrap Aggregating):** Trains multiple models on different bootstrap samples (randomly sampled with replacement) of the training data. The final prediction is the average or majority vote of the individual models.

**Pasting:** Similar to bagging, but without replacement. This can lead to less variance and more stable models.

## Stacking (blender)
Stacking, also known as stacked generalization, involves training a meta-model on the predictions of multiple base models.

<img src="../img/stacking_blender.png" width="60%">

**How it works:**
- Base Models: Train multiple base models on the training data.
- Predictions: Use the base models to make predictions on the training and validation data.
- Meta-Model: Train a meta-model (e.g., logistic regression, another decision tree) on the predictions of the base models as features.
- Final Prediction: The meta-model makes the final prediction.

**Stacking vs Voting Classifier**
A Voting Classifier combines the predictions of multiple base models through a simple voting scheme (hard of soft voting). A Stacking Classifier trains a meta-model (a new model) on the predictions of the base models. The predictions of the base models become features for the meta-model.

|  | Boosting | Bagging/Pasting | Stacking |
|---|---|---|---|
| Model Training | Sequential | Parallel | Sequential/Parallel |
| Focus | Correcting errors of previous models | Reducing variance | Combining diverse models |
| Key Technique | Gradient boosting, AdaBoost | Bootstrap sampling | Meta-model |