Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 49 lines (36 sloc) 1.907 kb
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
1 # About
2
0684057 @ashenfad Small readme update
ashenfad authored
3 This project is an implementation of the streaming, one-pass histograms described in Ben-Haim's
3ed0f66 @ashenfad Expanded readme
ashenfad authored
4 [Streaming Parallel Decision Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). The
5 histogram includes the extension added by Tyree's [Parallel Boosted Regression Trees]
6 (http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf) which
7 allows the histogram to include numeric targets (useful for regression trees). The histogram
8 follows a similar approach to support categorical targets (useful for classification trees).
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
9
0684057 @ashenfad Small readme update
ashenfad authored
10 The histograms act as an approximation of the underlying dataset. They can be used for
3ed0f66 @ashenfad Expanded readme
ashenfad authored
11 learning, visualization, discretization, or analysis. This includes finding the median or any
12 other percentile in one pass. The histograms may be built independently and merged, convenient
0684057 @ashenfad Small readme update
ashenfad authored
13 for parallel and distributed algorithms.
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
14
9971f0b @ashenfad added an example
ashenfad authored
15 # Building
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
16
9971f0b @ashenfad added an example
ashenfad authored
17 1. Install [maven](http://maven.apache.org/)
18 2. Make sure you have Java 1.6
19 3. Checkout the histogram project using Git
9604de1 @ashenfad Small improvement to the README
ashenfad authored
20 4. Run `mvn clean install`
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
21
9971f0b @ashenfad added an example
ashenfad authored
22 # Example
23
320840c @ashenfad toying with the markdown README
ashenfad authored
24 ```java
25 long pointCount = 100000;
26 int histogramBins = 100;
27 Random random = new Random();
28 Histogram hist = new Histogram(histogramBins);
29
30 for (long i = 0; i < pointCount; i++) {
31 hist.insert(random.nextGaussian());
32 }
33
34 //the sum at 0 should be about 50000
35 double sum = hist.sum(0);
4030d78 @ashenfad toying with the markdown README
ashenfad authored
36
9604de1 @ashenfad Small improvement to the README
ashenfad authored
37 //the split point between two uniform (by population) bins should be about 0
38 //this is an approximate median
4030d78 @ashenfad toying with the markdown README
ashenfad authored
39 double split = hist.uniform(2).get(0);
72b850d @ashenfad Added performance chart
ashenfad authored
40 ```
41
3ed0f66 @ashenfad Expanded readme
ashenfad authored
42 The `extendedSum` method is available for histograms that include categorical or numeric targets.
43 Examples for the extended histograms may be found in the test `com.bigml.histogram.HistogramTest.java`.
44
72b850d @ashenfad Added performance chart
ashenfad authored
45 # Performance
0684057 @ashenfad Small readme update
ashenfad authored
46 Scales `log(n)` with respect to the number of bins in the histogram.
72b850d @ashenfad Added performance chart
ashenfad authored
47
0684057 @ashenfad Small readme update
ashenfad authored
48 ![timing chart](https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
Something went wrong with that request. Please try again.