KoalaTrees

Decision tree machine learning algorithms for use with the Koala machine learning environment.

Basic usage

Load some data and rows for the train/test sets:

    julia> using Koala
    julia> X, y = load_ames()
    julia> train, test = split(eachindex(y), 0.8); # 80:20 split

This data consists of a mix of numerical and categorical features. By convention, any column of X whose eltype is a subtype of AbstractFloat is treated as numerical; all other columns are treated categorical (including integer columns, and with missing data, which have Union eltypes).

Let us instantiate a tree model:

    julia> using KoalaTrees
    julia> tree = TreeRegressor(regularization=0.5)
    TreeRegressor@...095

    julia> showall(tree)
    TreeRegressor@...095

    key                     | value
    ------------------------|------------------------
    extreme                 |false
    max_features            |0
    max_height              |1000
    min_patterns_split      |2
    penalty                 |0.0
    regularization          |0.5

Here max_features=0 means that all features are considered in computing splits at a node. We reset this as follows:

    tree.max_features = 3

Now we build and train a machine. The machine essentially wraps the model tree in the learning data supplied to the Machine constructor, transformed into a form appropriate for our tree building algorithm (but using only the train rows to calculate the transformation parameters):

    julia> treeM = Machine(tree, X, y, train)
    julia> fit!(treeM, train)
    julia> showall(treeM)
    
    SupervisedMachine{TreeRegressor@...095}@...548

	key                     | value
	------------------------|------------------------
	Xt                      |DataTableaux.DataTableau of   shape (1456, 12)
	metadata                |Object of type Array{Symbol,1}
	model                   |TreeRegressor@...095
	n_iter                  |1
	predictor               |Node{KoalaTrees.NodeData}@...3798
	scheme_X                |FrameToTableauScheme@...8626
	scheme_y                |nothing
	yt                      |Array{Float64,1} of shape (1456,)

	Model detail:
	TreeRegressor@...095

	key                     | value
	------------------------|------------------------
	extreme                 |false
	max_features            |3
	max_height              |1000
	min_patterns_split      |2
	penalty                 |0.0
	regularization          |0.8497534359086443
	
                        Feature importance at penalty=0.0:
                    ┌────────────────────────────────────────┐ 
          GrLivArea │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.225 │ 
       Neighborhood │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.153            │ 
        OverallQual │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.141             │ 
         BsmtFinSF1 │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.108                  │ 
         GarageArea │▪▪▪▪▪▪▪▪▪▪▪▪ 0.085                      │ 
        TotalBsmtSF │▪▪▪▪▪▪▪▪▪ 0.062                         │ 
            LotArea │▪▪▪▪▪▪▪▪ 0.056                          │ 
         MSSubClass │▪▪▪▪▪▪▪▪ 0.055                          │ 
          YearBuilt │▪▪▪▪▪▪ 0.041                            │ 
       YearRemodAdd │▪▪▪▪▪▪ 0.038                            │ 
          x1stFlrSF │▪▪▪▪ 0.026                              │ 
         GarageCars │▪▪ 0.012                                │ 
                    └────────────────────────────────────────┘

Compute the RMS error on the test set:

    julia> err(treeM, test)
    42581.70098526429

Tune the regularization parameter:

    julia> u, v = @curve r logspace(-3,2,100) begin
           t.regularization = r
           fit!(treeM, train)
           err(treeM, test)
       end
    julia> t.regularization = u[indmin(v)]
    0.8497534359086443

    julia> fit!(treeM, train)
    SupervisedMachine{TreeRegressor@...095}@...548

    julia> err(treeM, test)
    39313.459637964435

Model parameters

max_features=0: Number of features randomly selected at each node to determine splitting criterion selection (integer). If 0 (default) all features are used`
min_patterns_split=2: Minimum number of patterns at node to consider split (integer).
penalty=0 (range, [0,1]): Float between 0 and 1. The gain afforded by new features is penalized by multiplying by the factor 1 - penalty before being compared with the gain afforded by previously selected features. Useful for feature selection, as introduced in "Feature Selection via Regularized Trees", H. Deng and G Runger, International Joint Conference on Neural Networks (IJCNN), IEEE, 2012.
extreme=false: If true then the split of each feature considered is uniformly random rather than optimal. Mainly used to build extreme random forests using KoalaEnsembles.
regularization=0.0 (range, [0,1)): regularization in which predictions are a weighted sum of predictions at the leaf and its "nearest" neighbors. For details, see this post.
max_height=1000 (range, [0, Inf]): how high predictors look for "nearby" leaves in regularized predictions.
max_bin=0 (range, any non-negative integer except one, but effectively is rounded down to a power of 2): number of bins in histogram based-splitting (active if max_bin is non-zero).
bin_factor=90 (range, [1, ∞)): when the number of patterns at a node falls below bin_factor*max_bin then exact splitting replaces histogram splitting.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
test		test
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
REQUIRE		REQUIRE
appveyor.yml		appveyor.yml
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test

test

.codecov.yml

.codecov.yml

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.md

LICENSE.md

README.md

README.md

REQUIRE

REQUIRE

appveyor.yml

appveyor.yml

logo.png

logo.png

Repository files navigation

KoalaTrees

Basic usage

Model parameters

About

Releases

Packages

Languages

License

ablaom/KoalaTrees.jl

Folders and files

Latest commit

History

Repository files navigation

KoalaTrees

Basic usage

Model parameters

About

Resources

License

Stars

Watchers

Forks

Languages