[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases by srowen · Pull Request #14826 · apache/spark

srowen · 2016-08-26T10:50:42Z

What changes were proposed in this pull request?

Related to #14524 -- just the 'fix' rather than a behavior change.

PythonMLlibAPI methods that take a seed now always take a java.lang.Long consistently, allowing the Python API to specify "no seed"
.mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.

How was this patch tested?

Existing tests

…ases. Standardize .mllib classes to deafult to seed=None (except bisecting KMeans)

SparkQA · 2016-08-26T11:47:02Z

Test build #64465 has finished for PR 14826 at commit 8ac4a9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-08-26T17:49:25Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

      categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap)
    val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
+    // Only done because methods below want an int, not an optional Long
+    val intSeed = getSeedOrDefault(seed).toInt


This is kind of odd, but I see why you have it. Maybe it would be a little better to do
val intSeed = if (seed == null) Utils.random.nextInt else seed.toInt

That way it would avoid a slightly biased random Int

I was hoping to be consistent with the other places in the class that needed to maybe get a random seed. You're saying that the lower 32 bits won't necessarily be as random? I would think that theoretically they are. In practice we don't need cryptographic-strength guarantees here anyway.

yeah, I was trying to say that converting to a smaller number of bits would cause some be more likely than others, but I agree in practice it's not going to make any difference here.

BryanCutler · 2016-08-26T17:51:04Z

Just a small comment, and I also vote for changing the the default MLlib BisectingKmeans to seed=None to be consistent. Other than that, LGTM!

srowen · 2016-08-29T08:54:44Z

@mengxr what do you think about this narrower change? just double-checking.

jkbradley · 2016-08-31T02:13:48Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

      .setK(k)
      .setMaxIterations(maxIterations)
      .setMinDivisibleClusterSize(minDivisibleClusterSize)
      .setSeed(seed)


remove old line to set seed

Ugh, right. I had one thing to get right when I ported this change and ...

jkbradley · 2016-08-31T02:20:19Z

@srowen I like your decisions. Just the one issue as far as I can see.

jkbradley · 2016-08-31T02:31:44Z

Actually, was this a problem before? With the current master, I am able to avoid setting a seed in PySpark KMeans, and doing so gives me a different result on each call.

srowen · 2016-08-31T10:07:37Z

This is bisecting k-means? and the .mllib version? I think that's all this change could affect. All of the .ml classes have their own higher-level seed handling mechanism that would randomly pick a seed and send that through this API.

SparkQA · 2016-08-31T11:09:54Z

Test build #64714 has finished for PR 14826 at commit ae248c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-03T09:03:00Z

Assuming my interpretation of the last comment was right, I think this is good to go. Even if somehow it 'worked' before this is a cleaner implementation of the same behavior in that case.

…onal long seeds in all cases ## What changes were proposed in this pull request? Related to #14524 -- just the 'fix' rather than a behavior change. - PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed" - .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed - BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #14826 from srowen/SPARK-16832.2.

Standard Python-Java MLlib API to accept optional long seeds in all c…

8ac4a9b

…ases. Standardize .mllib classes to deafult to seed=None (except bisecting KMeans)

srowen mentioned this pull request Aug 26, 2016

[SPARK-16832] [ML] [WIP] CrossValidator and TrainValidationSplit are not random without seed #14524

Closed

srowen changed the title ~~[SPARK-16382] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases~~ [SPARK-16832] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases Aug 26, 2016

BryanCutler reviewed Aug 26, 2016
View reviewed changes

srowen changed the title ~~[SPARK-16832] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases~~ [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all cases Aug 30, 2016

srowen changed the title ~~[SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all cases~~ [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all casesz Aug 30, 2016

srowen changed the title ~~[SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all casesz~~ [SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all casesz Aug 30, 2016

srowen changed the title ~~[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all casesz~~ [SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases Aug 30, 2016

jkbradley reviewed Aug 31, 2016
View reviewed changes

Remove stale call to setSeed

ae248c2

srowen closed this Sep 4, 2016

srowen deleted the SPARK-16832.2 branch September 4, 2016 11:41

Conversation

srowen commented Aug 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

BryanCutler Aug 26, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Aug 26, 2016

Choose a reason for hiding this comment

Uh oh!

BryanCutler Aug 26, 2016

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 29, 2016

Uh oh!

jkbradley Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Aug 31, 2016

Uh oh!

jkbradley commented Aug 31, 2016

Uh oh!

srowen commented Aug 31, 2016

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

srowen commented Sep 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BryanCutler commented Aug 26, 2016 •

edited

Loading