[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982

BryanCutler · 2017-08-18T00:09:40Z

What changes were proposed in this pull request?

Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to wrapper._transfer_values_to_java. This incorrectly changes the state of the Param as it should still be marked as a default value only.

How was this patch tested?

Added a new test to verify that when transferring Params to Java, default params have their state preserved.

…g to Java

…icitly set

SparkQA · 2017-08-18T00:29:36Z

Test build #80813 has finished for PR 18982 at commit 1203056.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-18T00:48:34Z

cc @jkbradley @holdenk

holdenk

Thanks for working on this, two quick questions.

holdenk · 2017-08-19T00:02:38Z

python/pyspark/ml/wrapper.py

                self._java_obj.set(pair)
+            if param in self._defaultParamMap:


Should this be an else if? No need to transfer the default value if we've explicitly set it to another value.

We usually make the assumption that Python defines the same default values as Java, in Spark ML at least, but given the circumstances of the JIRA - they defined their own Model - then it's still possible for hasDefault or the default value to return something different that Python would. So I'm just being overly cautious here, but it's pretty cheap to just transfer the default values anyway right?

Sounds reasonable.

holdenk · 2017-08-19T00:03:14Z

python/pyspark/ml/tests.py

+    def test_preserve_set_state(self):
+        model = Binarizer()
+        self.assertFalse(model.isSet("threshold"))
+        model._transfer_params_to_java()


Would it make sense to do an actual transform here instead of the two inner parts of the transform?

yeah, it would be a little better to call the actual transform, but we would still need to call _transfer_params_from_java or check isSet with a direct call to Java via py4j. I was going to do this, but the ParamTest class doesn't already create a SparkSession - I'm sure it's just a small amount of overhead but that's why I thought to just use _transfer_params_to_java.

Do you think it would be worth it to change ParamTests to inherit from SparkSessionTestCase so a session is created and I could make a DataFrame to transform?

I think that would be a reasonable thing to do, the slight increase in testing overhead is probably worth it, it keeps us from being too closely tied to the implementation details and we already use SparkSessionTestCase in a lot of places.

BryanCutler · 2017-08-22T04:11:57Z

Thanks for reviewing @holdenk ! You brought up some good points, let me know if you prefer me to change them.

holdenk · 2017-09-06T18:59:20Z

@BryanCutler sorry my slowness, responded with a note on the tests. If you have a chance to update the test and update to master I'd love to try and get this in.

…to-java-defaults-SPARK-21685

SparkQA · 2017-09-07T00:01:28Z

Test build #81484 has finished for PR 18982 at commit 482c025.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-07T00:06:00Z

Jenkins retest this please

SparkQA · 2017-09-07T00:23:42Z

Test build #81487 has finished for PR 18982 at commit 482c025.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-07T00:27:12Z

No problem @holdenk, I updated using transform() on the test. See if it looks ok to you now (pending Jenkins). Thanks!

BryanCutler · 2017-09-07T00:27:35Z

Hmmm, I can repeat the error with Python3, I'll look into it tomorrow

holdenk · 2017-09-08T18:55:02Z

Cool, let me know if you want a hand reproing the error.

SparkQA · 2017-09-09T04:52:55Z

Test build #81574 has finished for PR 18982 at commit 1a9fa46.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-09T05:06:18Z

@holdenk , the error was because I was calling setDefault with a separate Param and value. So for something like a seed param, the Long value would get converted to an Int in the exchange to Java if using Python 3 and then cause a cast error. The only solution I could find was to create another method in Scala that takes a single ParamPair value because I just couldn't get the existing setDefault for a list of ParamPairs to work. Let me know if this sounds ok or you have another idea. Thanks!

SparkQA · 2017-09-09T07:04:46Z

Test build #81577 has finished for PR 18982 at commit 088ee52.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-11T17:14:45Z

retest this please

SparkQA · 2017-09-11T20:28:37Z

Test build #81647 has finished for PR 18982 at commit 088ee52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-10T20:50:59Z

Test build #83691 has finished for PR 18982 at commit 088ee52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-11-18T15:04:48Z

Can we update this to master?

…to-java-defaults-SPARK-21685

SparkQA · 2017-11-20T21:58:19Z

Test build #84035 has finished for PR 18982 at commit bef3fb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-11-20T23:07:19Z

Hi @holdenk , this is updated. Look ok to you?

…to-java-defaults-SPARK-21685

BryanCutler · 2018-02-08T20:09:13Z

ping @MLnick @holdenk , seems like this issue popped up a couple times recently. Can you please take a look here?

SparkQA · 2018-02-08T23:11:57Z

Test build #87228 has finished for PR 18982 at commit 339c793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-03-16T23:35:19Z

So between #20410 & this one which path do we want to go down?

holdenk

LGTM

HyukjinKwon · 2018-03-19T07:29:51Z

retest this please

HyukjinKwon · 2018-03-19T07:30:48Z

From my rough reading and based upon what I know, this seems fine.

SparkQA · 2018-03-19T11:36:24Z

Test build #88369 has finished for PR 18982 at commit 339c793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…to-java-defaults-SPARK-21685

SparkQA · 2018-03-21T07:05:01Z

Test build #88452 has finished for PR 18982 at commit 9162944.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-21T13:14:06Z

cc @viirya too

HyukjinKwon · 2018-03-21T13:14:17Z

retest this please

This reverts commit 088ee52.

This reverts commit 1a9fa46.

BryanCutler · 2018-03-21T16:36:13Z

I think I figured out how to use params setDefault(paramPairs: ParamPair[_]*) from python so we won't need to add setDefaultPair and the MiMa exclusion. Let me just do a little more testing to make sure.

SparkQA · 2018-03-21T17:17:21Z

Test build #88470 has finished for PR 18982 at commit 9162944.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T17:40:28Z

Test build #88475 has finished for PR 18982 at commit 24a1dbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T18:47:09Z

Test build #88481 has finished for PR 18982 at commit 2eaf1a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-21T21:54:30Z

Thanks @holdenk and @HyukjinKwon ! I made a small change so that this can call the existing setDefault method and can avoid adding anything new to the Scala side. It's a little less clean, but I think better not to add anything if it can be avoided for this case. Please take another look when you can.

viirya · 2018-03-22T06:07:18Z

python/pyspark/ml/wrapper.py

+        if len(pair_defaults) > 0:
+            sc = SparkContext._active_spark_context
+            pair_defaults_seq = sc._jvm.PythonUtils.toSeq(pair_defaults)
+            self._java_obj.setDefault(pair_defaults_seq)


If java side and python side the default params are the same, do we still need to set default params for the java object? Are't they already set in java object if they are default params?

My take is that while they should be the same, it's still possible they might not be. The user could extend their own classes or it's quite easy to change in Python. Although we don't really support this, if there was a mismatch the user would probably just get bad results and it would be really hard to figure out why. From the Python API, it would look like it was one value but actually using another in Scala.

If you all think it's overly cautious to do this, I can take it out. I just thought it would be cheap insurance to just set these values regardless.

I think this is reasonable, a few extra lines to avoid potential unwanted user surprise is worth it.

holdenk

LGTM

holdenk · 2018-03-23T18:43:50Z

Merged to master.

BryanCutler · 2018-03-23T19:16:25Z

Thanks @holdenk @HyukjinKwon and @viirya !

BryanCutler added 2 commits August 17, 2017 17:02

added regression test for preserving Param set state when transferrin…

bbe3ef7

…g to Java

changed _transfer_params_to_java to not set param in Java unless expl…

1203056

…icitly set

holdenk reviewed Aug 19, 2017

View reviewed changes

BryanCutler added 2 commits September 6, 2017 16:17

Merge remote-tracking branch 'upstream/master' into pyspark-ml-param-…

6f97a9d

…to-java-defaults-SPARK-21685

added transform to test

482c025

added python friendly method to set a single Java ParamPair

1a9fa46

added MiMa exclude

088ee52

BryanCutler added 2 commits November 20, 2017 10:07

Merge remote-tracking branch 'upstream/master' into pyspark-ml-param-…

fa4f974

…to-java-defaults-SPARK-21685

updated MimaExcludes

bef3fb5

BryanCutler added 2 commits February 8, 2018 11:18

Merge remote-tracking branch 'upstream/master' into pyspark-ml-param-…

bbd2144

…to-java-defaults-SPARK-21685

explicit calls to param checks

339c793

BryanCutler mentioned this pull request Feb 8, 2018

[SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params #20410

Closed

WeichenXu123 mentioned this pull request Feb 14, 2018

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20594

Closed

holdenk approved these changes Mar 16, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pyspark-ml-param-…

9162944

…to-java-defaults-SPARK-21685

BryanCutler added 2 commits March 21, 2018 09:30

Revert "added MiMa exclude"

037ebf7

This reverts commit 088ee52.

Revert "added python friendly method to set a single Java ParamPair"

32ab60d

This reverts commit 1a9fa46.

cleanup and add test for default transfer

24a1dbf

cleanup

2eaf1a7

viirya reviewed Mar 22, 2018

View reviewed changes

holdenk approved these changes Mar 23, 2018

View reviewed changes

asfgit closed this in cb43bbe Mar 23, 2018

BryanCutler deleted the pyspark-ml-param-to-java-defaults-SPARK-21685 branch November 19, 2018 05:47

zhengruifeng mentioned this pull request Sep 16, 2019

[SPARK-22797][ML][PYTHON] Bucketizer support multi-column #25801

Closed

[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982

[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982

Conversation

BryanCutler commented Aug 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 18, 2017

BryanCutler commented Aug 18, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Aug 22, 2017

holdenk commented Sep 6, 2017

SparkQA commented Sep 7, 2017

BryanCutler commented Sep 7, 2017

SparkQA commented Sep 7, 2017

BryanCutler commented Sep 7, 2017

BryanCutler commented Sep 7, 2017

holdenk commented Sep 8, 2017

SparkQA commented Sep 9, 2017

BryanCutler commented Sep 9, 2017

SparkQA commented Sep 9, 2017

BryanCutler commented Sep 11, 2017

SparkQA commented Sep 11, 2017

SparkQA commented Nov 10, 2017

holdenk commented Nov 18, 2017

SparkQA commented Nov 20, 2017

BryanCutler commented Nov 20, 2017

BryanCutler commented Feb 8, 2018

SparkQA commented Feb 8, 2018

holdenk commented Mar 16, 2018

holdenk left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 19, 2018

HyukjinKwon commented Mar 19, 2018

SparkQA commented Mar 19, 2018

SparkQA commented Mar 21, 2018

HyukjinKwon commented Mar 21, 2018

HyukjinKwon commented Mar 21, 2018

BryanCutler commented Mar 21, 2018

SparkQA commented Mar 21, 2018

SparkQA commented Mar 21, 2018

SparkQA commented Mar 21, 2018

BryanCutler commented Mar 21, 2018

viirya Mar 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

holdenk commented Mar 23, 2018

BryanCutler commented Mar 23, 2018

viirya Mar 22, 2018 •

edited