-
Notifications
You must be signed in to change notification settings - Fork 4
ml ensemble method
In machine learning, an ensemble method is a machine learning model that aggregates the predictions of individual models. Since ensemble methods combine the results of multiple models, they are less prone to error, and therefore tend to perform better.
Ensemble methods are generally classified into two types. The first type combines different machine learning models, such as scikit-learn's VotingClassifier, as chosen by the user. The second type of ensemble method combines many versions of the same model, as is the case with XGBoost and random forests. Random forests are among the most popular and widespread of all ensemble methods.
Although random forests use majority rules for classifiers, and the average of all models for regressors, they also use a special method called bagging, short for bootstrap aggregation, to select individual trees.
Bootstrapping means sampling with replacement.

When it comes to random forests, bootstrapping works under the hood. The bootstrapping occurs when each decision tree is made. If the decision trees all consisted of the same samples, the trees would give similar predictions making the aggregate result similar to the individual tree. Instead, with random forests, the trees are built using bootstrapping, usually with the same number of samples as in the original dataset. Mathematical estimations are that two-thirds of the samples for each tree are unique, and one-third include duplicates.
After the bootstrapping phase of the model-build, each decision tree makes its own individual predictions. The result is a forest of trees whose predictions are aggregated into one final prediction using majority rules for classifiers and the average for regressors.
At the end of the day, the random forest is limited by its individual trees. If all trees make the same mistake, the random forest makes this mistake. There are scenarios, as is revealed in this case study before the data was shuffled, where random forests are unable to significantly improve upon errors due to challenges within the data that individual trees are unable to address.
An ensemble method capable of improving upon initial shortcomings, an ensemble method that will learn from the mistakes of trees in future rounds, could be advantageous. Boosting was designed to learn from the mistakes of trees in early rounds.
Boosting, by contrast, learns from the mistakes of individual trees. The general idea is to adjust new trees based on the errors of previous trees.
In boosting, correcting errors for each new tree is a distinct approach from bagging. In a bagging model, new trees pay no attention to previous trees. Also, new trees are built from scratch using bootstrapping, and the final model aggregates all individual trees. In boosting, however, each new tree is built from the previous tree. The trees do not operate in isolation; instead, they are built on top of one another
Test