Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Prettier formatting

  • Loading branch information...
commit 6aa49cda07b2076891048fa093ccd0a6bc979ff3 1 parent 3ed0f66
@ashenfad ashenfad authored
Showing with 51 additions and 36 deletions.
  1. +22 −16 README.md
  2. +29 −20 src/main/java/com/bigml/histogram/Histogram.java
View
38 README.md
@@ -1,16 +1,20 @@
# About
-This project is an implementation of the streaming, one-pass histograms described in Ben-Haim's
-[Streaming Parallel Decision Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). The
-histogram includes the extension added by Tyree's [Parallel Boosted Regression Trees]
-(http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf) which
-allows the histogram to include numeric targets (useful for regression trees). The histogram
-follows a similar approach to support categorical targets (useful for classification trees).
-
-The histograms act as an approximation of the underlying dataset. They can be used for
-learning, visualization, discretization, or analysis. This includes finding the median or any
-other percentile in one pass. The histograms may be built independently and merged, convenient
-for parallel and distributed algorithms.
+This project is an implementation of the streaming, one-pass
+histograms described in Ben-Haim's [Streaming Parallel Decision
+Trees](http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html). The
+histogram includes the extension added by Tyree's [Parallel Boosted
+Regression Trees]
+(http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf)
+which allows the histogram to include numeric targets (useful for
+regression trees). The histogram follows a similar approach to support
+categorical targets (useful for classification trees).
+
+The histograms act as an approximation of the underlying dataset.
+They can be used for learning, visualization, discretization, or
+analysis. This includes finding the median or any other percentile in
+one pass. The histograms may be built independently and merged,
+convenient for parallel and distributed algorithms.
# Building
@@ -39,10 +43,12 @@ double sum = hist.sum(0);
double split = hist.uniform(2).get(0);
```
-The `extendedSum` method is available for histograms that include categorical or numeric targets.
-Examples for the extended histograms may be found in the test `com.bigml.histogram.HistogramTest.java`.
+The `extendedSum` method is available for histograms that include
+categorical or numeric targets. Examples for the extended histograms
+may be found in the test `com.bigml.histogram.HistogramTest.java`.
-# Performance
-Scales `log(n)` with respect to the number of bins in the histogram.
+# Performance Scales `log(n)` with respect to the number of bins in
+the histogram.
-![timing chart](https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
+![timing chart]
+(https://docs.google.com/spreadsheet/oimg?key=0Ah2oAcudnjP4dG1CLUluRS1rcHVqU05DQ2Z4UVZnbmc&oid=2&zx=mppmmoe214jm)
View
49 src/main/java/com/bigml/histogram/Histogram.java
@@ -11,17 +11,19 @@
/**
* Implements a Histogram as defined by the <a
- * href="http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html"> Streaming Parallel Decision Tree
- * (SPDT)</a> algorithm.
- *
- * <p>The Histogram consumes numeric points and maintains a running approximation of the dataset
- * using the given number of bins. The methods <code>insert</code>, <code>sum</code>, and
+ * href="http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html">
+ * Streaming Parallel Decision Tree (SPDT)</a> algorithm. <p>The
+ * Histogram consumes numeric points and maintains a running
+ * approximation of the dataset using the given number of bins. The
+ * methods <code>insert</code>, <code>sum</code>, and
* <code>uniform</code> are described in detail in the SPDT paper.
*
- * <p>The histogram has an <code>insert</code> method which uses two parameters and an
- * <code>extendedSum</code> method which add the capabilities described in <a
- * href="http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf"> Tyree's
- * paper</a>. Along with Tyree's extension this histogram supports inserts with categorical targets.
+ * <p>The histogram has an <code>insert</code> method which uses two
+ * parameters and an <code>extendedSum</code> method which add the
+ * capabilities described in <a
+ * href="http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf">
+ * Tyree's paper</a>. Along with Tyree's extension this histogram
+ * supports inserts with categorical targets.
*
* @author Adam Ashenfelter (ashenfelter@bigml.com)
*/
@@ -43,9 +45,10 @@ public Histogram(int maxBins) {
}
/**
- * Creates a Histogram initialized with the given <code>bins</code>. If the initial number of
- * <code>bins</code> exceeds the <code>maxBins</code> then the bins are merged until the histogram
- * is valid.
+ * Creates a Histogram initialized with the given
+ * <code>bins</code>. If the initial number of <code>bins</code>
+ * exceeds the <code>maxBins</code> then the bins are merged until
+ * the histogram is valid.
*
* @param maxBins the maximum number of bins for this histogram
* @param bins the initial bins for the histogram
@@ -99,7 +102,8 @@ public Histogram(int maxBins, Collection<Bin<T>> bins) throws MixedInsertExcepti
* @param point the new point
* @param target the categorical target
*/
- public Histogram<T> insertCategorical(double point, Object target) throws MixedInsertException {
+ public Histogram<T> insertCategorical(double point, Object target)
+ throws MixedInsertException {
checkType(TargetType.categorical);
insert(new Bin(point, 1, new CategoricalTarget(target)));
return this;
@@ -133,8 +137,9 @@ public double sum(double p_b) throws SumOutOfRangeException {
}
/**
- * Returns a <code>SumResult</code> object which contains the approximate number of points less
- * than <code>p_b</code> along with the sum of their targets.
+ * Returns a <code>SumResult</code> object which contains the
+ * approximate number of points less than <code>p_b</code> along
+ * with the sum of their targets.
*
* @param p_b the sum point
*/
@@ -145,7 +150,8 @@ public double sum(double p_b) throws SumOutOfRangeException {
double max = _bins.lastKey();
if (p_b < min || p_b > max) {
- throw new SumOutOfRangeException("Sum point " + p_b + " should be between " + min + " and " + max);
+ throw new SumOutOfRangeException("Sum point " + p_b + " should be between "
+ + min + " and " + max);
} else if (p_b == max) {
Bin<T> lastBin = _bins.lastEntry().getValue();
@@ -173,7 +179,8 @@ public double sum(double p_b) throws SumOutOfRangeException {
double bDiff = p_b - bin_i.getMean();
double pDiff = bin_i1.getMean() - bin_i.getMean();
double bpRatio = bDiff / pDiff;
- double m_b = bin_i.getCount() + (((bin_i1.getCount() - bin_i.getCount()) / pDiff) * bDiff);
+ double m_b = bin_i.getCount() +
+ (((bin_i1.getCount() - bin_i.getCount()) / pDiff) * bDiff);
double countSum = prevCount
+ (bin_i.getCount() / 2)
@@ -181,8 +188,9 @@ public double sum(double p_b) throws SumOutOfRangeException {
T targetSum_m_b = (T) bin_i1.getTarget().clone().subtractUpdate(bin_i.getTarget())
.multiplyUpdate(bDiff / pDiff).sumUpdate(bin_i.getTarget());
- T targetSum = (T) prevTargetSum.sumUpdate(bin_i.getTarget().clone().multiplyUpdate(0.5))
- .sumUpdate(targetSum_m_b.sumUpdate(bin_i.getTarget()).multiplyUpdate(bpRatio / 2d));
+ T targetSum = (T) prevTargetSum.sumUpdate(bin_i.getTarget().clone()
+ .multiplyUpdate(0.5)).sumUpdate(targetSum_m_b.sumUpdate(bin_i.getTarget())
+ .multiplyUpdate(bpRatio / 2d));
result = new SumResult<T>(countSum, targetSum);
}
@@ -191,7 +199,8 @@ public double sum(double p_b) throws SumOutOfRangeException {
}
/**
- * Returns a list containing split points that form bins with uniform membership
+ * Returns a list containing split points that form bins with
+ * uniform membership
*
* @param numberOfBins the desired number of uniform bins
*/
Please sign in to comment.
Something went wrong with that request. Please try again.