Given a training data set, it constructs a decision tree for classification or regression in a single batch or incrementally.
It loads data from CSV files. It expects the first row in the CSV to be a header, with each element conforming to the pattern "name:type:mode". Mode is optional, and denotes the class attribute. Type identifies the attribute as either a continuous, discrete, or nominal.
The module is loosely based on code published by Christopher Roach in his article Building Decision Trees in Python. I refactored his code to be more object-oriented, and extended it to support basic regression.
The class attribute can be either continuous, discrete or nominal, but all other attributes can only be discrete or nominal.
Download the code and then run:
python setup.py build sudo python setup.py install
You can also install from PyPI using pip via:
sudo pip install dtree
Or upgrade from an earlier version via:
sudo pip install --upgrade dtree
Classification and regression are handled through the same interface, and differ only in the object returned by the predict() method and how the result from test() is interpreted.
With classification, this object will always be a DDist instance, representing a probability distribution over a set of discrete or nominal classes. In this case, the result from test() will be a CDist instance representing the classification accuracy.
With regression, this object will always be a CDist instance, representing a mean and variance. In this case, the result from test() will be a CDist instance representing the mean absolute error.
from dtree import Tree, Data tree = Tree.build(Data('classification-training.csv')) result = t.test(Data('classification-testing.csv')) print 'Accuracy:',result.mean prediction = tree.predict(dict(feature1=123, feature2='abc', feature3='hot')) print 'best:',prediction.best print 'probs:',prediction.probs tree = Tree.build(Data('regression-training.csv')) result = t.test(Data('regression-testing.csv')) print 'MAE:',result.mean prediction = tree.predict(dict(feature1=123, feature2='abc', feature3='hot')) print 'mean:',prediction.mean print 'variance:',prediction.variance
- building a classification or regression tree using batch or incremental/online methods
Does not yet support:
- sparse training data
- sparse query vector
0.1.0 - 2012.01.24 Initial development.
0.2.0 - 2012.02.08 Refactored to support incremental/online tree construction and forests.