Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



- choose m = sqrt(p) features at random to split bc reducing m reduces correlation
- implement out of bag error OOR: For each observation zi = (xi,yi), construct its random
forest predictor by averaging only those trees corresponding to boot- strap samples in 
which zi did not appear. An oob error estimate is almost identical to that obtained by N-
fold cross- validation

1. Introduction

3 parts: traditional information-theoretic decision tree, evolutionary decision
tree derived from a forest of randomly trained trees, and lastly a random forest
of information theoretic decision trees that classify on majority vote of all
trees in the forest. The random forest has low variance and performs better 
than the others on most occasions.

About Random Forests:

From “Elements of Statistical Learning” by Hastie, Tibshirani, Friedman: “Bagging or 
bootstrap aggregation (section 8.7) is a technique for reducing the variance of an 
estimated prediction function. Bagging seems to work especially well for high-variance, 
low-bias procedures, such as trees. For regression, we simply fit the same regression tree 
many times to bootstrap- sampled versions of the training data, and average the result. 
For classification, a committee of trees each cast a vote for the predicted class”

2. Usage

All code resides in the src/ subdirectory, which also contains a Makefile that
can build and run the code.  As a convenience, the top level directory also
contains shellscripts which call `make`.  The and scripts also pass any command line arguments given to it 
along to the Java program. change to the rosset.corbin directory, then to build:


And to run the traditional learning algorithm type:

./ <IG or GR> <whichDataSet> <whichMonk>

Where “whichDataSet” is an integer input parameter (0=Congressional, 1=MONK, 
2=Mushroom). "IG or GR" is an integer, 0 for information gain selection strategy,
and 1 for gain ratio selection strategy. "whichMonk" is only applicable if
"whichDataSet" = 1, and then since there are 3 monk sets, "whichMonk" can be
either 1, 2, or 3. 

And to run the genetic algorithm, type:

./ <whichDataSet> <whichMonk> 

where "whichDataSet" and "whichMonk" input arguments are the same as for the

Please see the outpu/ forlder for the for the sample runs. Sometimes, there is 
more than one run for a given data set if I felt there was variablility in
performance. I give no analysis here explaining such performance or my choices
of parameters, but there are some comments in the code. 

The random forest can be run in the same manner as the traditional:

./ <IG or GR> <whichDataSet> <whichMonk>


to run traditional Gain Ratio on Monk-2:

./run_traditional 1 1 2

to run traditional information gain on mushroom:

./run_traditional 0 2

to run genetic algorithm on congressional data set:

./run_evolutionary 0

To generate documentation: 

cd to the src/ folder and type:

make docs

3. Structure

There is also a data/ directory which holds the example data for the three data
sets: monk, mushroom, and congressional. 

typing ls -R on the code/ directory in the src/ file:

Folders that represent packages: 

--------------------------------------------------------------------------------	Parser	Testing
Evolutionary					Traditional
Examples					Trees

files contained in each folder:


CongressionalExample.class	Example.class




You can’t perform that action at this time.