forest-research

Methods for increasing generalization ability based on different ways of ensembles building
For more details you can read thesis here or open files Thesis.pdf and Presentation.pdf

Annotation

This project's aim is development and research of new ensemble method based on decision trees maximally remote from each other. Below can be found comparison between the method presented in this project with other well known ensemble models: Random Forest and Adaptive Boosting.

Errors decompositions

There two factors which influence ensemble quality: quality of each ensemble's estimators and "difference" between each ensemble's estimators. Correctness of this statement can be shown by few different error decompositions which can be find in [1].

Method's work

y(x) is a true label for x object.
K is number of classes.
Node is a set of objects placed in current node, for which a feature and threshold are searched for.
is a tree built on step with number M.
Leaf(x) is a set of objects placed in the same leaf node as object x.
is an ensemble built on step with number M.
is a coefficient of previously builded trees' influence.

Below is placed general formula for building decision tree:

Below is formula which determine H(s) particulary for the method considered in this project:

General idea is to build different trees using the ensemble built on previous step, maximize its entropy and minimize the entropy of real labels.

Experiments

In experiments below method realized in this project is compared with Random Forest [2] and Adaptive Boosting [3] and also with combination of different pairs of these methods. All data can be found in UCI Machine Learning Repository [4]. Each step of experiment (x axis) is creation of one new tree for each algorithm involved in comparison.

Datasets

All datasets were randomly splitted to 2 equal parts 5 times. For each of this split one part used for train and another one for test, and then vice versa. Then all 10 different quality measures were averaged. Result for each step of algorithm can be seen in pictures below.

Classification task	Train size	Test size	Features	Classes
Optical Recognition of Handwritten Digits Data Set	5620	None	64	10
Credit scoring	1000	None	24	2
Glass Identification Data Set	214	None	9	6
Connectionist Bench (Sonar, Mines vs. Rocks) Data Set	208	None	60	2
Vehicle silhouettes	846	None	18	4

Optical Recognition of Handwritten Digits Data Set

Accuracy

Credit scoring

Accuracy

ROC-AUC

Glass Identification Data Set

Accuracy

Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

Accuracy

ROC-AUC

Vehicle silhouettes

Accuracy

Literature references

[1] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. — Chapman and Hall/CRC, 2012.
[2] Random Forest Classifier
[3] Ada Boost Classifier
[4] UCI Machine Learning Repository

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
Ensemble		Ensemble
pictures		pictures
Experiment.ipynb		Experiment.ipynb
Presentation.pdf		Presentation.pdf
README.md		README.md
Thesis.pdf		Thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

forest-research

Annotation

Errors decompositions

Method's work

Experiments

Optical Recognition of Handwritten Digits Data Set

Credit scoring

Glass Identification Data Set

Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

Vehicle silhouettes

Literature references

About

Releases

Packages

Contributors 2

Languages

dm-medvedev/forest-research

Folders and files

Latest commit

History

Repository files navigation

forest-research

Annotation

Errors decompositions

Method's work

Experiments

Optical Recognition of Handwritten Digits Data Set

Credit scoring

Glass Identification Data Set

Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

Vehicle silhouettes

Literature references

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages