Skip to content
Newer
Older
100644 58 lines (44 sloc) 1.79 KB
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
1 # About
2
6aa49cd @ashenfad Prettier formatting
ashenfad authored
3 This project is an implementation of the streaming, one-pass
4 histograms described in Ben-Haim's [Streaming Parallel Decision
5 Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). The
6 histogram includes the extension added by Tyree's [Parallel Boosted
7 Regression Trees]
8 (http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf)
9 which allows the histogram to include numeric targets (useful for
10 regression trees). The histogram follows a similar approach to support
11 categorical targets (useful for classification trees).
12
13 The histograms act as an approximation of the underlying dataset.
14 They can be used for learning, visualization, discretization, or
15 analysis. This includes finding the median or any other percentile in
16 one pass. The histograms may be built independently and merged,
17 convenient for parallel and distributed algorithms.
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
18
9971f0b @ashenfad added an example
ashenfad authored
19 # Building
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
20
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
21 1. Make sure you have Java 1.6
22 2. Install [leiningen](https://github.com/technomancy/leiningen)
23 3. Checkout the histogram project with git
24 4. Run `lein jar`
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
25
9971f0b @ashenfad added an example
ashenfad authored
26 # Example
27
320840c @ashenfad toying with the markdown README
ashenfad authored
28 ```java
29 long pointCount = 100000;
30 int histogramBins = 100;
31 Random random = new Random();
32 Histogram hist = new Histogram(histogramBins);
33
34 for (long i = 0; i < pointCount; i++) {
35 hist.insert(random.nextGaussian());
36 }
37
38 //the sum at 0 should be about 50000
39 double sum = hist.sum(0);
4030d78 @ashenfad toying with the markdown README
ashenfad authored
40
9604de1 @ashenfad Small improvement to the README
ashenfad authored
41 //the split point between two uniform (by population) bins should be about 0
42 //this is an approximate median
4030d78 @ashenfad toying with the markdown README
ashenfad authored
43 double split = hist.uniform(2).get(0);
72b850d @ashenfad Added performance chart
ashenfad authored
44 ```
45
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
46 ```clojure
47 (let [data (repeatedly 100000 #(rand))
48 hist (reduce insert! (create) data)]
49 (median hist))
50 ```
51
52 # Performance
3ed0f66 @ashenfad Expanded readme
ashenfad authored
53
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
54 Scales `log(n)` with respect to the number of bins in the histogram.
72b850d @ashenfad Added performance chart
ashenfad authored
55
6aa49cd @ashenfad Prettier formatting
ashenfad authored
56 ![timing chart]
57 (https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
Something went wrong with that request. Please try again.