Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 352 lines (294 sloc) 12.565 kB
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
1 # Overview
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
2
6aa49cd @ashenfad Prettier formatting
ashenfad authored
3 This project is an implementation of the streaming, one-pass
4 histograms described in Ben-Haim's [Streaming Parallel Decision
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
5 Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). Inspired
6 by Tyree's [Parallel Boosted Regression
7 Trees](http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf),
8 the histograms are extended to track multiple values.
9
10 The histograms act as an approximation of the underlying dataset. They
11 can be used for learning, visualization, discretization, or analysis.
12 The histograms may be built independently and merged, convenient for
13 parallel and distributed algorithms.
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
14
9971f0b @ashenfad added an example
ashenfad authored
15 # Building
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
16
5b5d02c @ashenfad Java version requirement is more accurate
ashenfad authored
17 1. Make sure you have Java 1.6 or newer
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
18 2. Install [leiningen](https://github.com/technomancy/leiningen)
19 3. Checkout the histogram project with git
20 4. Run `lein jar`
21d47c8 @ashenfad adding a markdown readme
ashenfad authored
21
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
22 # Basics
23
24 In the following examples we use [Incanter](http://incanter.org/) to
25 generate data and for charting.
26
27 The simplest way to use a histogram is to `create` one and then
28 `insert!` points. In the example below, `ex/normal-data` refers to a
29 sequence of 100K samples from a normal distribution (mean 0, variance
30 1).
31
32 ```clojure
33 user> (ns examples
34 (:require (histogram [core :as hst])
35 (histogram.test [examples :as ex])))
36 examples> (def hist (reduce hst/insert! (hst/create) ex/normal-data))
37 ```
38
39 You can use the `sum` fn to find the approximate number of points less
40 than a given threshold:
41
42 ```clojure
43 examples> (hst/sum hist 0)
44 50044.02331806754
45 ```
46
47 The `density` fn gives us an estimate of the point density at the
48 given location:
49
50 ```clojure
51 examples> (hst/density hist 0)
52 39687.562791977114
53 ```
54
55 The `uniform` fn returns a list of points that separate the
56 distribution into equal population areas. Here's an example that
57 produces quartiles:
58
59 ```clojure
60 examples> (hst/uniform hist 4)
61 (-0.6723425970050285 -0.0011145378611749357 0.6713314937601746)
62 ```
63
64 We can plot the sums and density estimates as functions. The red line
65 represents the sum, the blue line represents the density. If we
66 normalized the values (dividing by 100K), these lines approximate the
67 [cumulative distribution
68 function](http://en.wikipedia.org/wiki/Cumulative_distribution_function)
69 and the [probability distribution
70 function](http://en.wikipedia.org/wiki/Probability_density_function)
71 for the normal distribution.
72
73 ```clojure
74 examples> (ex/sum-density-chart hist)
75 ```
76 ![Histogram from normal distribution]
77 (https://img.skitch.com/20120427-jhrhpshfm6pppu3t3bu4kt9g7e.png)
78
79 The histogram approximates distributions using a constant number of
80 bins. This bin limit is a parameter when creating a histogram
81 (`:bins`, defaults to 64). A bin contains a `:count` of the points
82 within the bin along with the `:mean` for the values in the bin. The
83 edges of the bin aren't captured. Instead the histogram assumes that
84 points are distributed evenly with half the points less than the mean
85 and half greater. This explains the fraction sum in the example below:
86
87 ```clojure
88 examples> (def hist (-> (hst/create :bins 3)
89 (hst/insert! 1)
90 (hst/insert! 2)
91 (hst/insert! 3)))
92 examples> (hst/bins hist)
93 ({:mean 1.0, :count 1} {:mean 2.0, :count 1} {:mean 3.0, :count 1})
94 examples> (hst/sum hist 2)
95 1.5
96 ```
97
98 As mentioned earlier, the bin limit constrains the number of unique
99 bins a histogram can use to capture a distribution. The histogram
100 above was created with a limit of just three bins. When we add a
101 fourth unique value it will create a fourth bin and then merge the
102 nearest two.
103
104 ```clojure
105 examples> (hst/bins (hst/insert! hist 0.5))
106 ({:mean 0.75, :count 2} {:mean 2.0, :count 1} {:mean 3.0, :count 1})
107 ```
108
109 A larger bin limit means a higher quality picture of the distribution,
110 but it also means a larger memory footprint. In the chart below, the
111 red line represents a histogram with 16 bins and the blue line
112 represents 64 bins.
113
114 ```clojure
115 examples> (ex/multi-density-chart
116 [(reduce hst/insert! (hst/create :bins 16) ex/normal-data)
117 (reduce hst/insert! (hst/create :bins 64) ex/normal-data)])
118 ```
119 ![64 and 32 bins histograms]
120 (https://img.skitch.com/20120427-1x2fdrd7k5ks4rr9w59wkks7g.png)
121
122 Another option when creating a histogram is to use *gap
123 weighting*. When `:gap-weighted?` is true, the histogram is encouraged
124 to spend more of its bins capturing the densest areas of the
125 distribution. For the normal distribution that means better resolution
126 near the mean and less resolution near the tails. The chart below
127 shows a histogram without gap weighting in blue and with gap weighting
128 in red. Near the center of the distribution, red uses five bins in
129 roughly the same space that blue uses three.
130
131 ```clojure
132 examples> (ex/multi-density-chart
133 [(reduce hst/insert! (hst/create :bins 16 :gap-weighted? true)
134 ex/normal-data)
135 (reduce hst/insert! (hst/create :bins 16 :gap-weighted? false)
136 ex/normal-data)])
137 ```
138 ![Gap weighting vs. No gap weighting]
139 (https://img.skitch.com/20120427-x7591npy3393iqs2k2cqfrr5hn.png)
140
141 # Merging
142
143 A strength of the histograms is their ability to merge with one
144 another. Histograms can be built on separate data streams and then
145 combined to give a better overall picture.
146
147 ```clojure
148 examples> (let [samples (partition 1000 ex/normal-data)
149 hist1 (reduce hst/insert! (hst/create :bins 16) (first samples))
150 hist2 (reduce hst/insert! (hst/create :bins 16) (second samples))
151 merged (-> (hst/create :bins 16)
152 (hst/merge! hist1)
153 (hst/merge! hist2))]
154 (ex/multi-density-chart [hist1 hist2 merged]))
155 ```
156 ![Merged histograms]
157 (https://img.skitch.com/20120427-18ndb278u2bmep8aqq9bc3m7qk.png)
158
159 # Targets
160
161 While a simple histogram is nice for capturing the distribution of a
162 single variable, it's often important to capture the correlation
163 between variables. To that end, the histograms can track a second
164 variable called the *target*.
165
166 The target may be either numeric or categorical. The `insert!` fn is
167 overloaded to accept either type of target. Each histogram bin will
168 contain information summarizing the target. For numerics the targets
169 sums are tracked. For categoricals a map of counts is maintained.
170
171 ```clojure
172 examples> (-> (hst/create)
173 (hst/insert! 1 9)
174 (hst/insert! 2 8)
175 (hst/insert! 3 7)
176 (hst/insert! 3 6)
177 (hst/bins))
178 ({:target {:sum 9.0, :missing-count 0.0}, :mean 1.0, :count 1}
179 {:target {:sum 8.0, :missing-count 0.0}, :mean 2.0, :count 1}
180 {:target {:sum 13.0, :missing-count 0.0}, :mean 3.0, :count 2})
181 examples> (-> (hst/create)
182 (hst/insert! 1 :a)
183 (hst/insert! 2 :b)
184 (hst/insert! 3 :c)
185 (hst/insert! 3 :d)
186 (hst/bins))
187 ({:target {:counts {:a 1.0}, :missing-count 0.0}, :mean 1.0, :count 1}
188 {:target {:counts {:b 1.0}, :missing-count 0.0}, :mean 2.0, :count 1}
189 {:target {:counts {:d 1.0, :c 1.0}, :missing-count 0.0}, :mean 3.0, :count 2})
190 ```
191
192 Mixing target types isn't allowed:
193
194 ```clojure
195 examples> (-> (hst/create)
196 (hst/insert! 1 :a)
197 (hst/insert! 2 999))
198 Can't mix insert types
199 [Thrown class com.bigml.histogram.MixedInsertException]
200 ```
201
202 `insert-numeric!` and `insert-categorical!` allow target types to be
203 set explicitly:
204
205 ```clojure
206 examples> (-> (hst/create)
207 (hst/insert-categorical! 1 1)
208 (hst/insert-categorical! 1 2)
209 (hst/bins))
210 ({:target {:counts {2 1.0, 1 1.0}, :missing-count 0.0}, :mean 1.0, :count 2})
211 ```
212
213 The `extended-sum` fn works similarly to `sum`, but returns a result
214 that includes the target information:
215
216 ```clojure
217 examples> (-> (hst/create)
218 (hst/insert! 1 :a)
219 (hst/insert! 2 :b)
220 (hst/insert! 3 :c)
221 (hst/extended-sum 2))
222 {:sum 1.5, :target {:counts {:c 0.0, :b 0.5, :a 1.0}, :missing-count 0.0}}
223 ```
224
225 The `average-target` fn returns the average target value given a
226 point. To illustrate, the following histogram captures a dataset where
227 the input field is a sample from the normal distribution while the
228 target value is the sine of the input (but scaled and shifted to make
229 plotting easier). The density is in red and the average target value
230 is in blue:
231
232 ```clojure
233 examples> (def make-y (fn [x] (+ 10000 (* 10000 (Math/sin x)))))
234 examples> (def hist (let [target-data (map (fn [x] [x (make-y x)])
235 ex/normal-data)]
236 (reduce (fn [h [x y]] (hst/insert! h x y))
237 (hst/create)
238 target-data)))
239 examples> (ex/density-target-chart hist)
240 ```
241 ![Numeric target]
242 (https://img.skitch.com/20120427-q2y753qwnt4x1mhbs3ri9ddgt.png)
243
244 Continuing with the same histogram, we can see that `average-target`
245 produces values close to original target:
9971f0b @ashenfad added an example
ashenfad authored
246
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
247 ```clojure
248 examples> (def view-target (fn [x] {:actual (make-y x)
249 :approx (hst/average-target hist x)}))
250 examples> (view-target 0)
251 {:actual 10000.0, :approx {:sum 9617.150788081583, :missing-count 0.0}}
252 examples> (view-target (/ Math/PI 2))
253 {:actual 20000.0, :approx {:sum 19967.590011881348, :missing-count 0.0}}
254 examples> (view-target Math/PI)
255 {:actual 10000.000000000002, :approx {:sum 9823.774137889975, :missing-count 0.0}}
256 ```
320840c @ashenfad toying with the markdown README
ashenfad authored
257
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
258 # Missing Values
320840c @ashenfad toying with the markdown README
ashenfad authored
259
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
260 Information about missing values is captured whenever the input field
261 or the target is `nil`. The `missing-bin` fn retrieves information
262 summarizing the instances with a missing input. For a basic histogram,
263 that is simply the count:
4030d78 @ashenfad toying with the markdown README
ashenfad authored
264
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
265 ```clojure
266 examples> (-> (hst/create)
267 (hst/insert! nil)
268 (hst/insert! 7)
269 (hst/insert! nil)
270 (hst/missing-bin))
271 {:count 2}
72b850d @ashenfad Added performance chart
ashenfad authored
272 ```
273
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
274 For a histogram with a target, the `missing-bin` includes target
275 information:
276
277 ```clojure
278 examples> (-> (hst/create)
279 (hst/insert! nil :a)
280 (hst/insert! 7 :b)
281 (hst/insert! nil :c)
282 (hst/missing-bin))
283 {:target {:counts {:a 1.0, :c 1.0}, :missing-count 0.0}, :count 2}
284 ```
285
286 Targets can also be missing, in which case the target `missing-count`
287 is incremented:
288
289 ```clojure
290 examples> (-> (hst/create)
291 (hst/insert! nil :a)
292 (hst/insert! 7 :b)
293 (hst/insert! nil nil)
294 (hst/missing-bin))
295 {:target {:counts {:a 1.0}, :missing-count 1.0}, :count 2}
296 ```
297
298 # Array-backed Categorical Targets
299
300 By default a histogram with categorical targets stores the category
301 counts as Java HashMaps. Building and merging HashMaps can be
302 expensive. Alternatively the category counts can be backed by an
303 array. This can give better performance but requires the set of
304 possible categories to be declared when the histogram is created. To
305 do this, set the `:categories` parameter:
306
307 ```clojure
308 examples> (def categories (map (partial str "c") (range 50)))
309 examples> (def data (vec (repeatedly 100000
310 #(vector (rand) (str "c" (rand-int 50))))))
311 examples> (doseq [hist [(hst/create) (hst/create :categories categories)]]
312 (time (reduce (fn [h [x y]] (hst/insert! h x y))
313 hist
314 data)))
315 "Elapsed time: 1295.402 msecs"
316 "Elapsed time: 516.72 msecs"
317 ```
318
319 # Group Targets
320
321 Group targets allow the histogram to track multiple targets at the
322 same time. Each bin contains a sequence of target
323 information. Optionally, the target types in the group can be declared
324 when creating the histogram. Declaring the types on creation allows
325 the targets to be missing in the first insert:
326
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
327 ```clojure
d9d88f0 @ashenfad Adds documentation and examples
ashenfad authored
328 examples> (-> (hst/create :group-types [:categorical :numeric])
329 (hst/insert! 1 [:a nil])
330 (hst/insert! 2 [:b 8])
331 (hst/insert! 3 [:c 7])
332 (hst/insert! 1 [:d 6])
333 (hst/bins))
334 ({:target ({:counts {:a 1.0, :d 1.0}, :missing-count 0.0}
335 {:sum 6.0, :missing-count 1.0}),
336 :mean 1.0, :count 2}
337 {:target ({:counts {:b 1.0}, :missing-count 0.0}
338 {:sum 8.0, :missing-count 0.0}),
339 :mean 2.0, :count 1}
340 {:target ({:counts {:c 1.0}, :missing-count 0.0}
341 {:sum 7.0, :missing-count 0.0}),
342 :mean 3.0, :count 1})
edbed95 @ashenfad Fewer lies in the readme
ashenfad authored
343 ```
344
345 # Performance
3ed0f66 @ashenfad Expanded readme
ashenfad authored
346
7b7a20e @ashenfad README is slightly more verbose
ashenfad authored
347 Insert time scales `log(n)` with respect to the number of bins in the
348 histogram.
72b850d @ashenfad Added performance chart
ashenfad authored
349
6aa49cd @ashenfad Prettier formatting
ashenfad authored
350 ![timing chart]
351 (https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
Something went wrong with that request. Please try again.