Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5893][ML] Add bucketizer #5980

Closed
wants to merge 13 commits into from
Closed

Conversation

yinxusen
Copy link
Contributor

@yinxusen yinxusen commented May 7, 2015

JIRA issue here.

One thing to make clear, the buckets parameter, which is an array of Double, performs as split points. Say,

buckets = Array(-0.5, 0.0, 0.5)

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

@AmplabJenkins
Copy link

Merged build triggered.

@yinxusen yinxusen changed the title [SPARK-5893] [SPARK-5893][ML] Add bucketizer May 7, 2015
@yinxusen
Copy link
Contributor Author

yinxusen commented May 7, 2015

@jkbradley

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32110 has started for PR 5980 at commit 998bc87.

@yinxusen
Copy link
Contributor Author

yinxusen commented May 7, 2015

@jkbradley, I will push the refactored #5779 as soon as this PR merged.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32110/
Test FAILed.

@shaneknapp
Copy link
Contributor

i'll retrigger this build once jenkins is back up...

@AmplabJenkins
Copy link

Merged build triggered.

@yinxusen
Copy link
Contributor Author

yinxusen commented May 7, 2015

I change the Bucketizer from a Transform to a Model, otherwise we can not use it as the output of an Estimator.

@AmplabJenkins
Copy link

Merged build started.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32114 has started for PR 5980 at commit 11fb00a.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32115 has started for PR 5980 at commit 11fb00a.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32114 has finished for PR 5980 at commit 11fb00a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class Bucketizer(override val parent: Estimator[Bucketizer] = null)

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32114/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32115 has finished for PR 5980 at commit 11fb00a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class Bucketizer(override val parent: Estimator[Bucketizer] = null)

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32115/
Test PASSed.

* `Bucketizer` maps a column of continuous features to a column of feature buckets.
*/
@AlphaComponent
final class Bucketizer(override val parent: Estimator[Bucketizer] = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use 2 constructors (1 with an argument, 1 without) to be Java-friendly. Also, can we make the constructor with 1 argument be private[ml] for now?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32193/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented May 8, 2015

@yinxusen @jkbradley Instead of having three parameters to control the buckets, is it simpler to let the users provide the boundaries directly? For example, [0, 1, 2, ..., 24] gives 24 buckets. If users want to include values out of this range, they should provide [-inf, 0, 1, 2, 3, ..., 24, inf] instead. This would also simplify the implementation.

@yinxusen
Copy link
Contributor Author

yinxusen commented May 8, 2015

@mengxr How to represent -inf and inf? One possible solution is

val inf = Double.MaxValue
val negInf = Double.MinValue
splits = Array(negInf, 0, 1, 2, 3, inf)

Or we can define inf and negInf as double parameters, without setter.

val inf = new DoubleParam(this, "inf", "")
setDefault(inf -> Double.MaxValue)
def getInf = $(inf)
val negInf = new DoubleParam(this, "negInf", "")
setDefault(negInf -> Double.MinValue)
def getNegInf = $(negInf)

However, it would break the meaning of splits, I think.

@mengxr
Copy link
Contributor

mengxr commented May 8, 2015

inf is in the IEEE standard for floating-point numbers:

@jkbradley
Copy link
Member

@mengxr I thought about this, but here's the comment I posted above:

One issue was brought up: What about ranges of buckets where the ends (-inf, firstSplit] and (lastSplit, inf) don't make sense? E.g., for ages, we don't want a range (-inf, 0]. We could include this in a few ways:

(1) Make users include -inf and/or inf in the splits if they want those ranges.
(2) Have Boolean parameters lowerInclusive, upperInclusive to indicate whether to include those ranges.

I prefer the Boolean parameters with defaults of true. That will make it easy to have the default behavior accept all numerical values. When the range does not to to -inf and inf, then we'll have to throw an error and report the bad value.

@mengxr
Copy link
Contributor

mengxr commented May 8, 2015

Sorry that I didn't see the hidden GitHub comments. I think it depends on the use cases of bucketizer. My guess is that in most cases users use bucketizer on values that they already know the range. For example, time -> hourOfDay, dayOfMonth, or age -> 10-year buckets. In those cases, accepting (-inf, firstSplit] and (lastSplit, inf) by default may result empty buckets. Another issue is what the expected behavior is for values out of range. Should they go to a new bucket, or be merged into the closest bucket? Letting users explicitly specify the buckets would be able to avoid those problems.

@jkbradley
Copy link
Member

It's hard to say for use cases. It's true empty buckets might be a problem. I'm OK with requiring explicit buckets for -inf, inf; let's just do that, but we'll have to make sure we document that in the Scala doc + examples. For values out of range, there are several behaviors the user might want:

  • new bucket for all bad values
  • closest bucket
  • throw an error (to make user aware of bad data or assumptions)

We could eventually support all of these via a new parameter, but for now, shall we choose 1? I suppose I'd vote for putting values in the closest bucket (which will mean the first and last split values will be changed to -inf, inf under the hood).

@mengxr
Copy link
Contributor

mengxr commented May 8, 2015

Yes, we need document all cases. I kinda like 3 (throw an error) as the default, because it is the most strict. If there is an exception, users should decide to switch to 1 or 2 by changing the bucket boundaries.

@jkbradley
Copy link
Member

@yinxusen OK, so I think we've converged. Can you please update it to take splits, but no "lowerInclusive, upperInclusive" parameters? Let's start with 3 (throw an error) for handling bad values. It will be great if the driver catches the SparkException and prints a more useful error message, recommending the user set the bin boundaries to -inf, inf.

A later PR can add an extra parameter for supporting all 3 behaviors.

@yinxusen
Copy link
Contributor Author

@mengxr @jkbradley I see. I will be soon, say, in 2 days. It's really a little busy here recently. Sorry for the delay.

@jkbradley
Copy link
Member

@yinxusen I'm worried about this missing 1.4. Would you mind if I sent a PR to you to update this PR?

@yinxusen
Copy link
Contributor Author

@jkbradley Yes, thanks! Sorry for the delay again.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32430 has started for PR 5980 at commit dc8c843.

@yinxusen
Copy link
Contributor Author

@jkbradley I have merged your PR. Thanks!

@jkbradley
Copy link
Member

@yinxusen Thanks! I'll merge once tests pass

@SparkQA
Copy link

SparkQA commented May 11, 2015

Test build #32430 has finished for PR 5980 at commit dc8c843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32430/
Test PASSed.

@jkbradley
Copy link
Member

Merging into master and branch-1.4. @yinxusen Thanks!

asfgit pushed a commit that referenced this pull request May 12, 2015
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer

(cherry picked from commit 35fb42a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 35fb42a May 12, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request apache#4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants