A library for training and evaluating random forests.
The folder cpp contains the C++ implementation (multithreaded and distributed variant).
The folder utils contains some MATLAB and python scripts to convert data and forests between different formats.
The folder python contains an older Python implementation using Cython. The testing code could still be useful for forests trained with the C++ code.
The folder data contains some MATLAB scripts to generate synthetic data and to convert MATLAB data format to CSV data format. Check it out for details on the data formats. To some extend it can also be used for testing the code.
This code has been tested and no major bugs have been found. Nevertheless, this software is provided "as is", without warranty of any kind.
The library uses boost (tested with 1.59.0 and 1.60.0), Eigen (tested with 3.2.0 and 3.2.8), CImg (included), TCLAP (included) and cereal (included). The code also uses RapidJSON but this is already included in cereal. The distributed code requires boost-mpi for communication.
Linux and OS X
mkdir cpp/build cd cpp/build cmake .. # Optionally modify CMake configuration to enable/disable multi-threading, MPI, Matlab, HDF5 support etc. # On a Linux system with MPI you might typically do something like this: #cmake -DWITH_MPI=TRUE .. # ... or this if you want Matlab support: #cmake -DWITH_MPI=TRUE -DWITH_MATLAB=TRUE -DMATLAB_INCLUDE_DIRS=/usr/local/Matlab/R2015a/extern/include/ -DMATLAB_LIB_DIR=/usr/local/Matlab/R2015a/bin/glnxa64/ .. make -j4
The code compiles with Visual Studio 2013 and Visual Studio 2015. You need to compile boost (tested with 1.59.0) which requires libpng and zlib. I would recommend to compile everything with 64bit support.
depth_forest_trainer: Trains a new forest depth-first level_forest_trainer: Trains a new forest breadh-first dist_forest_trainer: Trains a new forest in a distributed manner using MPI. forest_predictor: Predict labels for a dataset. Can also be used to evaluate a dataset with ground-truth. forest_converter: Converts a forest in JSON or binary format to MATLAB format.
Data input file format
CSV format: Data and label images are given as individual image files. A .csv file contains path to the data and label images.
The data is given as a Matlab .mat file with two fields:
labels (default names, could be changed).
Both fields contains a 3-dimensional array of size
N is the number of images,
W is the width and
H is the height of the images.
The arrays should be of type double.
Labels should be from
C is the number of classes. Negative labels are considered as background pixels and will be ignored.
Data output file format
HDF5 and MATLAB format: Predictions can be output as a HDF5 or MATLAB file. In both cases the file contains a dataset/matrix with the predicted labels for each input image (in the same order as the input images).
This is research code, so please excuse the lack of testing. I will try to add coverage in the future but this is currently low on my priority list.
Scaling of distributed implementation
The distributed trainer is working but the scaling is suboptimal because the tree is trained level-wise and communication between nodes grows quickly with tree-depth. I will improve the scaling as soon as possible by introducing a switch to depth-first training as soon as the reached tree-level has as many nodes as workers are available.
Checkpointing of forest is implemented for the level-based training code. However, it is not well-tested. It will be improved together with the distributed code.