Browse files

Updates the readme and examples with new charts

  • Loading branch information...
1 parent d516788 commit 80ac2fbeeab21ce3959df4d0cfded75ac1b5e2e1 @ashenfad ashenfad committed Jan 28, 2013
Showing with 51 additions and 48 deletions.
  1. +37 −31
  2. +14 −17 test/histogram/test/examples.clj
@@ -41,15 +41,15 @@ than a given threshold:
examples> (sum hist 0)
The `density` fn gives us an estimate of the point density at the
given location:
examples> (density hist 0)
The `uniform` fn returns a list of points that separate the
@@ -58,7 +58,13 @@ produces quartiles:
examples> (uniform hist 4)
-(-0.67234 -0.00111 0.67133)
+(-0.66904 0.00229 0.67605)
+Arbritrary percentiles can be found using `percentiles`:
+examples> (percentiles hist 0.5 0.95 0.99)
+{0.5 0.00229, 0.95 1.63853, 0.99 2.31390}
We can plot the sums and density estimates as functions. The red line
@@ -74,15 +80,16 @@ for the normal distribution.
examples> (ex/sum-density-chart hist) ;; also see (ex/cdf-pdf-chart hist)
![Histogram from normal distribution]
The histogram approximates distributions using a constant number of
bins. This bin limit is a parameter when creating a histogram
(`:bins`, defaults to 64). A bin contains a `:count` of the points
within the bin along with the `:mean` for the values in the bin. The
edges of the bin aren't captured. Instead the histogram assumes that
-points are distributed evenly with half the points less than the mean
-and half greater. This explains the fraction sum in the example below:
+points of a bin are distributed with half the points less than the bin
+mean and half greater. This explains the fractional sum in the example
examples> (def hist (-> (create :bins 3)
@@ -108,53 +115,54 @@ examples> (bins (insert! hist 0.5))
A larger bin limit means a higher quality picture of the distribution,
but it also means a larger memory footprint. In the chart below, the
-red line represents a histogram with 16 bins and the blue line
+red line represents a histogram with 8 bins and the blue line
represents 64 bins.
examples> (ex/multi-pdf-chart
- [(reduce insert! (create :bins 16) ex/normal-data)
- (reduce insert! (create :bins 64) ex/normal-data)])
+ [(reduce insert! (create :bins 8) ex/mixed-normal-data)
+ (reduce insert! (create :bins 64) ex/mixed-normal-data)])
-![64 and 32 bins histograms]
+![8 and 64 bins histograms]
Another option when creating a histogram is to use *gap
weighting*. When `:gap-weighted?` is true, the histogram is encouraged
to spend more of its bins capturing the densest areas of the
distribution. For the normal distribution that means better resolution
near the mean and less resolution near the tails. The chart below
shows a histogram without gap weighting in blue and with gap weighting
-in red. Near the center of the distribution, red uses six bins in
-roughly the same space that blue uses three.
+in red. Near the center of the distribution, red uses more bins and
+better captures the gaussian distribution's true curve.
examples> (ex/multi-pdf-chart
- [(reduce insert! (create :bins 16 :gap-weighted? true)
+ [(reduce insert! (create :bins 8 :gap-weighted? true)
- (reduce insert! (create :bins 16 :gap-weighted? false)
+ (reduce insert! (create :bins 8 :gap-weighted? false)
![Gap weighting vs. No gap weighting]
# Merging
A strength of the histograms is their ability to merge with one
another. Histograms can be built on separate data streams and then
combined to give a better overall picture.
+In this example, the blue line shows a density distribution from a
+histogram after merging 300 noisy histograms. The red shows one of the
+original histograms:
-examples> (let [samples (partition 1000 ex/normal-data)
- hist1 (reduce insert! (create :bins 16) (first samples))
- hist2 (reduce insert! (create :bins 16) (second samples))
- merged (-> (create :bins 16)
- (merge! hist1)
- (merge! hist2))]
- (ex/multi-density-chart [hist1 hist2 merged]))
+examples> (let [samples (partition 1000 ex/mixed-normal-data)
+ hists (map #(reduce insert! (create) %) samples)
+ merged (reduce merge! (create) (take 300 hists))]
+ (ex/multi-pdf-chart [(first hists) merged]))
![Merged histograms]
# Targets
@@ -238,9 +246,8 @@ examples> (-> (create)
The `average-target` fn returns the average target value given a
point. To illustrate, the following histogram captures a dataset where
the input field is a sample from the normal distribution while the
-target value is the sine of the input (but scaled and shifted to make
-plotting easier). The density is in red and the average target value
-is in blue:
+target value is the sine of the input. The density is in red and the
+average target value is in blue:
examples> (def make-y (fn [x] (Math/sin x)))
@@ -252,21 +259,20 @@ examples> (def hist (let [target-data (map (fn [x] [x (make-y x)])
examples> (ex/pdf-target-chart hist)
![Numeric target]
Continuing with the same histogram, we can see that `average-target`
produces values close to original target:
examples> (def view-target (fn [x] {:actual (make-y x)
:approx (:sum (average-target hist x))}))
-{:actual 0.0, :approx -0.04261679840707788}
examples> (view-target 0)
-{:actual 0.0, :approx -0.04261679840707788}
+{:actual 0.0, :approx -0.00051}
examples> (view-target (/ Math/PI 2))
{:actual 1.0, :approx 0.9968169965429206}
examples> (view-target Math/PI)
-{:actual 0.0, :approx 0.021364059655214544}
+{:actual 0.0, :approx 0.00463}
# Missing Values
@@ -4,8 +4,15 @@
[charts :as charts]
[distributions :as dst])))
-;; 100K samples from a normal distribution (mean 0 and variance 1)
-(def normal-data (repeatedly 100000 #(dst/draw (dst/normal-distribution))))
+;; Mixed samples from four normal distributions
+(def mixed-normal-data
+ (shuffle (concat (repeatedly 160000 #(dst/draw (dst/normal-distribution 0 0.2)))
+ (repeatedly 80000 #(dst/draw (dst/normal-distribution 1 0.2)))
+ (repeatedly 40000 #(dst/draw (dst/normal-distribution 2 0.2)))
+ (repeatedly 20000 #(dst/draw (dst/normal-distribution 3 0.2))))))
+(def normal-data
+ (repeatedly 200000 #(dst/draw (dst/normal-distribution 0 1))))
(defn multi-pdf-chart [hists]
(let [min (reduce min (map (comp :min hst/bounds) hists))
@@ -16,15 +23,6 @@
(charts/function-plot (hst/pdf (first hists)) min max)
(next hists)))))
-(defn multi-density-chart [hists]
- (let [min (reduce min (map (comp :min hst/bounds) hists))
- max (reduce max (map (comp :max hst/bounds) hists))]
- (core/view
- (reduce (fn [c h]
- (charts/add-function c #(hst/density h %) min max))
- (charts/function-plot #(hst/density (first hists) %) min max)
- (next hists)))))
(defn sum-density-chart [hist]
(let [{:keys [min max]} (hst/bounds hist)]
(core/view (-> (charts/function-plot #(hst/sum hist %) min max)
@@ -43,21 +41,20 @@
;; Builds and charts a histogram for the normal distribution.
(defn- normal-example []
- (let [hist (reduce hst/insert! (hst/create) normal-data)]
+ (let [hist (reduce hst/insert! (hst/create :bins 32) normal-data)]
(println "Total sum of points less than 0:" (hst/sum hist 0))
(println "Quartile splits:" (hst/uniform hist 4))
- (sum-density-chart hist)
(cdf-pdf-chart hist)))
(defn- varying-bins-example []
- [(reduce hst/insert! (hst/create :bins 16) normal-data)
- (reduce hst/insert! (hst/create :bins 64) normal-data)]))
+ [(reduce hst/insert! (hst/create :bins 8) mixed-normal-data)
+ (reduce hst/insert! (hst/create :bins 64) mixed-normal-data)]))
(defn- gap-weighted-example []
- [(reduce hst/insert! (hst/create :bins 16 :gap-weighted? true) normal-data)
- (reduce hst/insert! (hst/create :bins 16 :gap-weighted? false) normal-data)]))
+ [(reduce hst/insert! (hst/create :bins 8 :gap-weighted? true) normal-data)
+ (reduce hst/insert! (hst/create :bins 8 :gap-weighted? false) normal-data)]))
(defn- numeric-target-example []
(let [target-data (map (fn [x] [x (Math/sin x)])

0 comments on commit 80ac2fb

Please sign in to comment.