Methods for increasing generalization ability based on different ways of ensembles building
For more details you can read thesis here or open files Thesis.pdf and Presentation.pdf
This project's aim is development and research of new ensemble method based on decision trees maximally remote from each other. Below can be found comparison between the method presented in this project with other well known ensemble models: Random Forest and Adaptive Boosting.
There two factors which influence ensemble quality: quality of each ensemble's estimators and "difference" between each ensemble's estimators. Correctness of this statement can be shown by few different error decompositions which can be find in [1].
-
y(x) is a true label for x object.
-
K is number of classes.
-
Node is a set of objects placed in current node, for which a feature and threshold are searched for.
-
Leaf(x) is a set of objects placed in the same leaf node as object x.
Below is placed general formula for building decision tree:
Below is formula which determine H(s) particulary for the method considered in this project:
General idea is to build different trees using the ensemble built on previous step, maximize its entropy and minimize the entropy of real labels.
In experiments below method realized in this project is compared with Random Forest [2] and Adaptive Boosting [3] and also with combination of different pairs of these methods. All data can be found in UCI Machine Learning Repository [4]. Each step of experiment (x axis) is creation of one new tree for each algorithm involved in comparison.
Datasets
All datasets were randomly splitted to 2 equal parts 5 times. For each of this split one part used for train and another one for test, and then vice versa. Then all 10 different quality measures were averaged. Result for each step of algorithm can be seen in pictures below.
Classification task | Train size | Test size | Features | Classes |
---|---|---|---|---|
Optical Recognition of Handwritten Digits Data Set | 5620 | None | 64 | 10 |
Credit scoring | 1000 | None | 24 | 2 |
Glass Identification Data Set | 214 | None | 9 | 6 |
Connectionist Bench (Sonar, Mines vs. Rocks) Data Set | 208 | None | 60 | 2 |
Vehicle silhouettes | 846 | None | 18 | 4 |
Accuracy
Accuracy
ROC-AUC
Accuracy
Accuracy
ROC-AUC
Accuracy
[1] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. — Chapman and Hall/CRC, 2012.
[2] Random Forest Classifier
[3] Ada Boost Classifier
[4] UCI Machine Learning Repository