[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases#14826
[SPARK-17311] [MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases#14826srowen wants to merge 2 commits intoapache:masterfrom
Conversation
…ases. Standardize .mllib classes to deafult to seed=None (except bisecting KMeans)
|
Test build #64465 has finished for PR 14826 at commit
|
| categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap) | ||
| val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK) | ||
| // Only done because methods below want an int, not an optional Long | ||
| val intSeed = getSeedOrDefault(seed).toInt |
There was a problem hiding this comment.
This is kind of odd, but I see why you have it. Maybe it would be a little better to do
val intSeed = if (seed == null) Utils.random.nextInt else seed.toInt
That way it would avoid a slightly biased random Int
There was a problem hiding this comment.
I was hoping to be consistent with the other places in the class that needed to maybe get a random seed. You're saying that the lower 32 bits won't necessarily be as random? I would think that theoretically they are. In practice we don't need cryptographic-strength guarantees here anyway.
There was a problem hiding this comment.
yeah, I was trying to say that converting to a smaller number of bits would cause some be more likely than others, but I agree in practice it's not going to make any difference here.
|
Just a small comment, and I also vote for changing the the default MLlib BisectingKmeans to |
|
@mengxr what do you think about this narrower change? just double-checking. |
| .setK(k) | ||
| .setMaxIterations(maxIterations) | ||
| .setMinDivisibleClusterSize(minDivisibleClusterSize) | ||
| .setSeed(seed) |
There was a problem hiding this comment.
Ugh, right. I had one thing to get right when I ported this change and ...
|
@srowen I like your decisions. Just the one issue as far as I can see. |
|
Actually, was this a problem before? With the current master, I am able to avoid setting a seed in PySpark KMeans, and doing so gives me a different result on each call. |
|
This is bisecting k-means? and the .mllib version? I think that's all this change could affect. All of the .ml classes have their own higher-level seed handling mechanism that would randomly pick a seed and send that through this API. |
|
Test build #64714 has finished for PR 14826 at commit
|
|
Assuming my interpretation of the last comment was right, I think this is good to go. Even if somehow it 'worked' before this is a cleaner implementation of the same behavior in that case. |
…onal long seeds in all cases ## What changes were proposed in this pull request? Related to #14524 -- just the 'fix' rather than a behavior change. - PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed" - .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed - BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #14826 from srowen/SPARK-16832.2.
What changes were proposed in this pull request?
Related to #14524 -- just the 'fix' rather than a behavior change.
java.lang.Longconsistently, allowing the Python API to specify "no seed"How was this patch tested?
Existing tests