Skip to content

[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases#14826

Closed
srowen wants to merge 2 commits intoapache:masterfrom
srowen:SPARK-16832.2
Closed

[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases#14826
srowen wants to merge 2 commits intoapache:masterfrom
srowen:SPARK-16832.2

Conversation

@srowen
Copy link
Member

@srowen srowen commented Aug 26, 2016

What changes were proposed in this pull request?

Related to #14524 -- just the 'fix' rather than a behavior change.

  • PythonMLlibAPI methods that take a seed now always take a java.lang.Long consistently, allowing the Python API to specify "no seed"
  • .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
  • BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.

How was this patch tested?

Existing tests

…ases. Standardize .mllib classes to deafult to seed=None (except bisecting KMeans)
@srowen srowen changed the title [SPARK-16382] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases [SPARK-16832] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases Aug 26, 2016
@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64465 has finished for PR 14826 at commit 8ac4a9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap)
val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
// Only done because methods below want an int, not an optional Long
val intSeed = getSeedOrDefault(seed).toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of odd, but I see why you have it. Maybe it would be a little better to do
val intSeed = if (seed == null) Utils.random.nextInt else seed.toInt

That way it would avoid a slightly biased random Int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to be consistent with the other places in the class that needed to maybe get a random seed. You're saying that the lower 32 bits won't necessarily be as random? I would think that theoretically they are. In practice we don't need cryptographic-strength guarantees here anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was trying to say that converting to a smaller number of bits would cause some be more likely than others, but I agree in practice it's not going to make any difference here.

@BryanCutler
Copy link
Member

BryanCutler commented Aug 26, 2016

Just a small comment, and I also vote for changing the the default MLlib BisectingKmeans to seed=None to be consistent. Other than that, LGTM!

@srowen
Copy link
Member Author

srowen commented Aug 29, 2016

@mengxr what do you think about this narrower change? just double-checking.

@srowen srowen changed the title [SPARK-16832] [MLLIB] Standard Python-Java MLlib API to accept optional long seeds in all cases [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all cases Aug 30, 2016
@srowen srowen changed the title [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all cases [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all casesz Aug 30, 2016
@srowen srowen changed the title [SPARK-17311] [MLLIB] Standardie Python-Java MLlib API to accept optional long seeds in all casesz [SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all casesz Aug 30, 2016
@srowen srowen changed the title [SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all casesz [SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases Aug 30, 2016
.setK(k)
.setMaxIterations(maxIterations)
.setMinDivisibleClusterSize(minDivisibleClusterSize)
.setSeed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove old line to set seed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, right. I had one thing to get right when I ported this change and ...

@jkbradley
Copy link
Member

@srowen I like your decisions. Just the one issue as far as I can see.

@jkbradley
Copy link
Member

Actually, was this a problem before? With the current master, I am able to avoid setting a seed in PySpark KMeans, and doing so gives me a different result on each call.

@srowen
Copy link
Member Author

srowen commented Aug 31, 2016

This is bisecting k-means? and the .mllib version? I think that's all this change could affect. All of the .ml classes have their own higher-level seed handling mechanism that would randomly pick a seed and send that through this API.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64714 has finished for PR 14826 at commit ae248c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Sep 3, 2016

Assuming my interpretation of the last comment was right, I think this is good to go. Even if somehow it 'worked' before this is a cleaner implementation of the same behavior in that case.

@srowen srowen closed this Sep 4, 2016
@srowen srowen deleted the SPARK-16832.2 branch September 4, 2016 11:41
asfgit pushed a commit that referenced this pull request Sep 4, 2016
…onal long seeds in all cases

## What changes were proposed in this pull request?

Related to #14524 -- just the 'fix' rather than a behavior change.

- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14826 from srowen/SPARK-16832.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants