-
Notifications
You must be signed in to change notification settings - Fork 4
ml ensemble method
In machine learning, an ensemble method is a machine learning model that aggregates the predictions of individual models. Since ensemble methods combine the results of multiple models, they are less prone to error, and therefore tend to perform better.
Ensemble methods are generally classified into two types. The first type combines different machine learning models, such as scikit-learn's VotingClassifier, as chosen by the user. The second type of ensemble method combines many versions of the same model, as is the case with XGBoost and random forests. Random forests are among the most popular and widespread of all ensemble methods.
Although random forests use majority rules for classifiers, and the average of all models for regressors, they also use a special method called bagging, short for bootstrap aggregation, to select individual trees.
Bootstrapping means sampling with replacement.

When it comes to random forests, bootstrapping works under the hood. The bootstrapping occurs when each decision tree is made. If the decision trees all consisted of the same samples, the trees would give similar predictions making the aggregate result similar to the individual tree. Instead, with random forests, the trees are built using bootstrapping, usually with the same number of samples as in the original dataset. Mathematical estimations are that two-thirds of the samples for each tree are unique, and one-third include duplicates.
After the bootstrapping phase of the model-build, each decision tree makes its own individual predictions. The result is a forest of trees whose predictions are aggregated into one final prediction using majority rules for classifiers and the average for regressors.
At the end of the day, the random forest is limited by its individual trees. If all trees make the same mistake, the random forest makes this mistake. There are scenarios, as is revealed in this case study before the data was shuffled, where random forests are unable to significantly improve upon errors due to challenges within the data that individual trees are unable to address.
An ensemble method capable of improving upon initial shortcomings, an ensemble method that will learn from the mistakes of trees in future rounds, could be advantageous. Boosting was designed to learn from the mistakes of trees in early rounds.
Boosting, by contrast, learns from the mistakes of individual trees. The general idea is to adjust new trees based on the errors of previous trees.
In boosting, correcting errors for each new tree is a distinct approach from bagging. In a bagging model, new trees pay no attention to previous trees. Also, new trees are built from scratch using bootstrapping, and the final model aggregates all individual trees. In boosting, however, each new tree is built from the previous tree. The trees do not operate in isolation; instead, they are built on top of one another.
AdaBoost is one of the earliest and most popular boosting models. In AdaBoost, each new tree adjusts its weights based on the errors of the previous trees. More attention is paid to predictions that went wrong by adjusting weights that affect those samples at a higher percentage. By learning from its mistakes, AdaBoost can transform weak learners into strong learners. A weak learner is a machine learning algorithm that barely performs better than chance. By contrast, a stronger learner has learned a considerable amount from data and performs quite well.
The general idea behind boosting algorithms is to transform weak learners into strong
learners. A weak learner is hardly better than random guessing. But there is a purpose
behind the weak start. Building on this general idea, boosting works by focusing on
iterative error correction, not by establishing a strong baseline model. If the base model is
too strong, the learning process is necessarily limited, thereby undermining the general
strategy behind boosting models.
Weak learners are transformed into strong learners through hundreds of iterations.
In this sense, a small edge goes a long way.
Gradient boosting uses a different approach than AdaBoost. While gradient boosting also adjusts based on incorrect predictions, it takes this idea one step further: gradient boosting fits each new tree entirely based on the errors of the previous tree's predictions. That is, for each new tree, gradient boosting looks at the mistakes and then builds a new tree completely around these mistakes. The new tree doesn't care about the predictions that are already correct. Building a machine learning algorithm that solely focuses on the errors requires a comprehensive method that sums errors to make accurate final predictions. This method leverages residuals, the difference between the model's predictions and actual values.
The residuals are the difference between the errors and the predictions of a given model. In statistics, residuals are commonly analyzed to determine how good a given linear regression model fits the data. The goal of linear regression is to minimize the square of the residuals.
The initial decision tree in the gradient boosting regressor is called the base learner because it's at the base of the ensemble. It's the first learner in the process. The term learner here is indicative of a weak learner transforming into a strong learner.
XGBoost is an advanced version of gradient boosting with the same general structure,
meaning that it transforms weak learners into strong learners by summing the residuals
of trees.
Test