[SPARK-5893][ML] Add bucketizer #5980

yinxusen · 2015-05-07T13:10:14Z

JIRA issue here.

One thing to make clear, the buckets parameter, which is an array of Double, performs as split points. Say,

buckets = Array(-0.5, 0.0, 0.5)

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

AmplabJenkins · 2015-05-07T13:12:10Z

Merged build triggered.

yinxusen · 2015-05-07T13:12:19Z

@jkbradley

AmplabJenkins · 2015-05-07T13:12:19Z

Merged build started.

SparkQA · 2015-05-07T13:12:32Z

Test build #32110 has started for PR 5980 at commit 998bc87.

yinxusen · 2015-05-07T13:38:22Z

@jkbradley, I will push the refactored #5779 as soon as this PR merged.

AmplabJenkins · 2015-05-07T14:45:55Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-07T14:45:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32110/
Test FAILed.

shaneknapp · 2015-05-07T14:46:17Z

i'll retrigger this build once jenkins is back up...

AmplabJenkins · 2015-05-07T14:47:11Z

Merged build triggered.

yinxusen · 2015-05-07T14:49:41Z

I change the Bucketizer from a Transform to a Model, otherwise we can not use it as the output of an Estimator.

AmplabJenkins · 2015-05-07T15:03:57Z

Merged build started.

shaneknapp · 2015-05-07T15:04:05Z

jenkins, test this please

SparkQA · 2015-05-07T15:06:52Z

Test build #32114 has started for PR 5980 at commit 11fb00a.

AmplabJenkins · 2015-05-07T15:07:13Z

Merged build triggered.

AmplabJenkins · 2015-05-07T15:07:20Z

Merged build started.

SparkQA · 2015-05-07T15:09:40Z

Test build #32115 has started for PR 5980 at commit 11fb00a.

SparkQA · 2015-05-07T16:51:21Z

Test build #32114 has finished for PR 5980 at commit 11fb00a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class Bucketizer(override val parent: Estimator[Bucketizer] = null)

AmplabJenkins · 2015-05-07T16:51:27Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-07T16:51:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32114/
Test PASSed.

SparkQA · 2015-05-07T16:55:14Z

Test build #32115 has finished for PR 5980 at commit 11fb00a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class Bucketizer(override val parent: Estimator[Bucketizer] = null)

AmplabJenkins · 2015-05-07T16:55:21Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-07T16:55:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32115/
Test PASSed.

jkbradley · 2015-05-07T18:51:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+ * `Bucketizer` maps a column of continuous features to a column of feature buckets.
+ */
+@AlphaComponent
+final class Bucketizer(override val parent: Estimator[Bucketizer] = null)


This should use 2 constructors (1 with an argument, 1 without) to be Java-friendly. Also, can we make the constructor with 1 argument be private[ml] for now?

AmplabJenkins · 2015-05-08T05:45:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32193/
Test PASSed.

mengxr · 2015-05-08T06:15:08Z

@yinxusen @jkbradley Instead of having three parameters to control the buckets, is it simpler to let the users provide the boundaries directly? For example, [0, 1, 2, ..., 24] gives 24 buckets. If users want to include values out of this range, they should provide [-inf, 0, 1, 2, 3, ..., 24, inf] instead. This would also simplify the implementation.

yinxusen · 2015-05-08T06:31:08Z

@mengxr How to represent -inf and inf? One possible solution is

val inf = Double.MaxValue
val negInf = Double.MinValue
splits = Array(negInf, 0, 1, 2, 3, inf)

Or we can define inf and negInf as double parameters, without setter.

val inf = new DoubleParam(this, "inf", "")
setDefault(inf -> Double.MaxValue)
def getInf = $(inf)
val negInf = new DoubleParam(this, "negInf", "")
setDefault(negInf -> Double.MinValue)
def getNegInf = $(negInf)

However, it would break the meaning of splits, I think.

mengxr · 2015-05-08T07:24:12Z

inf is in the IEEE standard for floating-point numbers:

Java Double.POSITIVE_INFINITY: https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#POSITIVE_INFINITY
Scala Double.PositiveInfinity: http://www.scala-lang.org/api/current/index.html#scala.Double$
Python float('inf'): https://docs.python.org/2/library/functions.html#float

jkbradley · 2015-05-08T16:13:15Z

@mengxr I thought about this, but here's the comment I posted above:

One issue was brought up: What about ranges of buckets where the ends (-inf, firstSplit] and (lastSplit, inf) don't make sense? E.g., for ages, we don't want a range (-inf, 0]. We could include this in a few ways:

(1) Make users include -inf and/or inf in the splits if they want those ranges.
(2) Have Boolean parameters lowerInclusive, upperInclusive to indicate whether to include those ranges.

I prefer the Boolean parameters with defaults of true. That will make it easy to have the default behavior accept all numerical values. When the range does not to to -inf and inf, then we'll have to throw an error and report the bad value.

mengxr · 2015-05-08T18:31:14Z

Sorry that I didn't see the hidden GitHub comments. I think it depends on the use cases of bucketizer. My guess is that in most cases users use bucketizer on values that they already know the range. For example, time -> hourOfDay, dayOfMonth, or age -> 10-year buckets. In those cases, accepting (-inf, firstSplit] and (lastSplit, inf) by default may result empty buckets. Another issue is what the expected behavior is for values out of range. Should they go to a new bucket, or be merged into the closest bucket? Letting users explicitly specify the buckets would be able to avoid those problems.

jkbradley · 2015-05-08T19:17:08Z

It's hard to say for use cases. It's true empty buckets might be a problem. I'm OK with requiring explicit buckets for -inf, inf; let's just do that, but we'll have to make sure we document that in the Scala doc + examples. For values out of range, there are several behaviors the user might want:

new bucket for all bad values
closest bucket
throw an error (to make user aware of bad data or assumptions)

We could eventually support all of these via a new parameter, but for now, shall we choose 1? I suppose I'd vote for putting values in the closest bucket (which will mean the first and last split values will be changed to -inf, inf under the hood).

mengxr · 2015-05-08T21:02:17Z

Yes, we need document all cases. I kinda like 3 (throw an error) as the default, because it is the most strict. If there is an exception, users should decide to switch to 1 or 2 by changing the bucket boundaries.

jkbradley · 2015-05-08T21:06:24Z

@yinxusen OK, so I think we've converged. Can you please update it to take splits, but no "lowerInclusive, upperInclusive" parameters? Let's start with 3 (throw an error) for handling bad values. It will be great if the driver catches the SparkException and prints a more useful error message, recommending the user set the bin boundaries to -inf, inf.

A later PR can add an extra parameter for supporting all 3 behaviors.

yinxusen · 2015-05-11T02:11:47Z

@mengxr @jkbradley I see. I will be soon, say, in 2 days. It's really a little busy here recently. Sorry for the delay.

jkbradley · 2015-05-11T02:16:47Z

@yinxusen I'm worried about this missing 1.4. Would you mind if I sent a PR to you to update this PR?

yinxusen · 2015-05-11T04:42:33Z

@jkbradley Yes, thanks! Sorry for the delay again.

…ed splits instead.

Removed lowerInclusive, upperInclusive params from Bucketizer

AmplabJenkins · 2015-05-11T22:02:14Z

Merged build triggered.

AmplabJenkins · 2015-05-11T22:02:20Z

Merged build started.

SparkQA · 2015-05-11T22:02:53Z

Test build #32430 has started for PR 5980 at commit dc8c843.

yinxusen · 2015-05-11T22:06:38Z

@jkbradley I have merged your PR. Thanks!

jkbradley · 2015-05-11T23:25:34Z

@yinxusen Thanks! I'll merge once tests pass

SparkQA · 2015-05-11T23:49:00Z

Test build #32430 has finished for PR 5980 at commit dc8c843.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-11T23:49:05Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-11T23:49:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32430/
Test PASSed.

jkbradley · 2015-05-12T01:40:28Z

Merging into master and branch-1.4. @yinxusen Thanks!

JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer (cherry picked from commit 35fb42a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer

yinxusen added 3 commits May 7, 2015 16:09

add bucketizer

5fe190e

add test suite

4024cf1

check buckets

998bc87

yinxusen changed the title ~~[SPARK-5893]~~ [SPARK-5893][ML] Add bucketizer May 7, 2015

change it into an Estimator

11fb00a

jkbradley reviewed May 7, 2015
View reviewed changes

jkbradley and others added 3 commits May 11, 2015 12:10

Removed lowerInclusive, upperInclusive params from Bucketizer, and us…

34f124a

…ed splits instead.

one more bucketizer test

1ca973a

Merge pull request #4 from jkbradley/yinxusen-SPARK-5893

dc8c843

Removed lowerInclusive, upperInclusive params from Bucketizer

asfgit closed this in 35fb42a May 12, 2015

[SPARK-5893][ML] Add bucketizer #5980

[SPARK-5893][ML] Add bucketizer #5980

Conversation

yinxusen commented May 7, 2015

AmplabJenkins commented May 7, 2015

yinxusen commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

yinxusen commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

shaneknapp commented May 7, 2015

AmplabJenkins commented May 7, 2015

yinxusen commented May 7, 2015

AmplabJenkins commented May 7, 2015

shaneknapp commented May 7, 2015

SparkQA commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

SparkQA commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

jkbradley May 7, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 8, 2015

mengxr commented May 8, 2015

yinxusen commented May 8, 2015

mengxr commented May 8, 2015

jkbradley commented May 8, 2015

mengxr commented May 8, 2015

jkbradley commented May 8, 2015

mengxr commented May 8, 2015

jkbradley commented May 8, 2015

yinxusen commented May 11, 2015

jkbradley commented May 11, 2015

yinxusen commented May 11, 2015

AmplabJenkins commented May 11, 2015

AmplabJenkins commented May 11, 2015

SparkQA commented May 11, 2015

yinxusen commented May 11, 2015

jkbradley commented May 11, 2015

SparkQA commented May 11, 2015

AmplabJenkins commented May 11, 2015

AmplabJenkins commented May 11, 2015

jkbradley commented May 12, 2015