Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 36 lines (25 sloc) 1.427 kB
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
1 # About
2
ce85bb0 @ashenfad Edited README.markdown via GitHub
ashenfad authored
3 This project is an implementation of the streaming, one-pass histograms described in Ben-Haim's [Streaming Parallel Decision Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). It also includes the extensions added by Tyree's [Parallel Boosted Regression Trees](http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf).
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
4
5 The histograms act as an approximation of the underlying dataset. They can be used for visualization, discretization, or analysis. This includes finding the median or any other percentile in one pass. The histograms may be built independently and combined, making them a good fit for map-reduce algorithms.
6
9971f0b @ashenfad added an example
ashenfad authored
7 # Building
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
8
9971f0b @ashenfad added an example
ashenfad authored
9 1. Install [maven](http://maven.apache.org/)
10 2. Make sure you have Java 1.6
11 3. Checkout the histogram project using Git
9604de1 @ashenfad Small improvement to the README
ashenfad authored
12 4. Run `mvn clean install`
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
13
9971f0b @ashenfad added an example
ashenfad authored
14 # Example
15
320840c @ashenfad toying with the markdown README
ashenfad authored
16 ```java
17 long pointCount = 100000;
18 int histogramBins = 100;
19 Random random = new Random();
20 Histogram hist = new Histogram(histogramBins);
21
22 for (long i = 0; i < pointCount; i++) {
23 hist.insert(random.nextGaussian());
24 }
25
26 //the sum at 0 should be about 50000
27 double sum = hist.sum(0);
4030d78 @ashenfad toying with the markdown README
ashenfad authored
28
9604de1 @ashenfad Small improvement to the README
ashenfad authored
29 //the split point between two uniform (by population) bins should be about 0
30 //this is an approximate median
4030d78 @ashenfad toying with the markdown README
ashenfad authored
31 double split = hist.uniform(2).get(0);
72b850d @ashenfad Added performance chart
ashenfad authored
32 ```
33
34 # Performance
35
36 ![timing chart](https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
Something went wrong with that request. Please try again.