Skip to content

ablaom/KoalaTrees.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KoalaTrees logo

Decision tree machine learning algorithms for use with the Koala machine learning environment.

Basic usage

Load some data and rows for the train/test sets:

    julia> using Koala
    julia> X, y = load_ames()
    julia> train, test = split(eachindex(y), 0.8); # 80:20 split

This data consists of a mix of numerical and categorical features. By convention, any column of X whose eltype is a subtype of AbstractFloat is treated as numerical; all other columns are treated categorical (including integer columns, and with missing data, which have Union eltypes).

Let us instantiate a tree model:

    julia> using KoalaTrees
    julia> tree = TreeRegressor(regularization=0.5)
    TreeRegressor@...095

    julia> showall(tree)
    TreeRegressor@...095

    key                     | value
    ------------------------|------------------------
    extreme                 |false
    max_features            |0
    max_height              |1000
    min_patterns_split      |2
    penalty                 |0.0
    regularization          |0.5

Here max_features=0 means that all features are considered in computing splits at a node. We reset this as follows:

    tree.max_features = 3

Now we build and train a machine. The machine essentially wraps the model tree in the learning data supplied to the Machine constructor, transformed into a form appropriate for our tree building algorithm (but using only the train rows to calculate the transformation parameters):

    julia> treeM = Machine(tree, X, y, train)
    julia> fit!(treeM, train)
    julia> showall(treeM)
    
    SupervisedMachine{TreeRegressor@...095}@...548

	key                     | value
	------------------------|------------------------
	Xt                      |DataTableaux.DataTableau of   shape (1456, 12)
	metadata                |Object of type Array{Symbol,1}
	model                   |TreeRegressor@...095
	n_iter                  |1
	predictor               |Node{KoalaTrees.NodeData}@...3798
	scheme_X                |FrameToTableauScheme@...8626
	scheme_y                |nothing
	yt                      |Array{Float64,1} of shape (1456,)

	Model detail:
	TreeRegressor@...095

	key                     | value
	------------------------|------------------------
	extreme                 |false
	max_features            |3
	max_height              |1000
	min_patterns_split      |2
	penalty                 |0.0
	regularization          |0.8497534359086443
	
                        Feature importance at penalty=0.0:
                    ┌────────────────────────────────────────┐ 
          GrLivArea │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.225 │ 
       Neighborhood │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.153            │ 
        OverallQual │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.141             │ 
         BsmtFinSF1 │▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ 0.108                  │ 
         GarageArea │▪▪▪▪▪▪▪▪▪▪▪▪ 0.085                      │ 
        TotalBsmtSF │▪▪▪▪▪▪▪▪▪ 0.062                         │ 
            LotArea │▪▪▪▪▪▪▪▪ 0.056                          │ 
         MSSubClass │▪▪▪▪▪▪▪▪ 0.055                          │ 
          YearBuilt │▪▪▪▪▪▪ 0.041                            │ 
       YearRemodAdd │▪▪▪▪▪▪ 0.038                            │ 
          x1stFlrSF │▪▪▪▪ 0.026                              │ 
         GarageCars │▪▪ 0.012                                │ 
                    └────────────────────────────────────────┘ 

Compute the RMS error on the test set:

    julia> err(treeM, test)
    42581.70098526429

Tune the regularization parameter:

    julia> u, v = @curve r logspace(-3,2,100) begin
           t.regularization = r
           fit!(treeM, train)
           err(treeM, test)
       end
    julia> t.regularization = u[indmin(v)]
    0.8497534359086443

    julia> fit!(treeM, train)
    SupervisedMachine{TreeRegressor@...095}@...548

    julia> err(treeM, test)
    39313.459637964435

Model parameters

  • max_features=0: Number of features randomly selected at each node to determine splitting criterion selection (integer). If 0 (default) all features are used`

  • min_patterns_split=2: Minimum number of patterns at node to consider split (integer).

  • penalty=0 (range, [0,1]): Float between 0 and 1. The gain afforded by new features is penalized by multiplying by the factor 1 - penalty before being compared with the gain afforded by previously selected features. Useful for feature selection, as introduced in "Feature Selection via Regularized Trees", H. Deng and G Runger, International Joint Conference on Neural Networks (IJCNN), IEEE, 2012.

  • extreme=false: If true then the split of each feature considered is uniformly random rather than optimal. Mainly used to build extreme random forests using KoalaEnsembles.

  • regularization=0.0 (range, [0,1)): regularization in which predictions are a weighted sum of predictions at the leaf and its "nearest" neighbors. For details, see this post.

  • max_height=1000 (range, [0, Inf]): how high predictors look for "nearby" leaves in regularized predictions.

  • max_bin=0 (range, any non-negative integer except one, but effectively is rounded down to a power of 2): number of bins in histogram based-splitting (active if max_bin is non-zero).

  • bin_factor=90 (range, [1, ∞)): when the number of patterns at a node falls below bin_factor*max_bin then exact splitting replaces histogram splitting.

About

Decision tree learning algorithms for use with MLJ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages