Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to wrapper._transfer_values_to_java. This incorrectly changes the state of the Param as it should still be marked as a default value only.

How was this patch tested?

Added a new test to verify that when transferring Params to Java, default params have their state preserved.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80813 has finished for PR 18982 at commit 1203056.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

cc @jkbradley @holdenk

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, two quick questions.

self._java_obj.set(pair)
if param in self._defaultParamMap:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an else if? No need to transfer the default value if we've explicitly set it to another value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually make the assumption that Python defines the same default values as Java, in Spark ML at least, but given the circumstances of the JIRA - they defined their own Model - then it's still possible for hasDefault or the default value to return something different that Python would. So I'm just being overly cautious here, but it's pretty cheap to just transfer the default values anyway right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

def test_preserve_set_state(self):
model = Binarizer()
self.assertFalse(model.isSet("threshold"))
model._transfer_params_to_java()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do an actual transform here instead of the two inner parts of the transform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it would be a little better to call the actual transform, but we would still need to call _transfer_params_from_java or check isSet with a direct call to Java via py4j. I was going to do this, but the ParamTest class doesn't already create a SparkSession - I'm sure it's just a small amount of overhead but that's why I thought to just use _transfer_params_to_java.

Do you think it would be worth it to change ParamTests to inherit from SparkSessionTestCase so a session is created and I could make a DataFrame to transform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a reasonable thing to do, the slight increase in testing overhead is probably worth it, it keeps us from being too closely tied to the implementation details and we already use SparkSessionTestCase in a lot of places.

@BryanCutler
Copy link
Member Author

Thanks for reviewing @holdenk ! You brought up some good points, let me know if you prefer me to change them.

@holdenk
Copy link
Contributor

holdenk commented Sep 6, 2017

@BryanCutler sorry my slowness, responded with a note on the tests. If you have a chance to update the test and update to master I'd love to try and get this in.

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81484 has finished for PR 18982 at commit 482c025.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81487 has finished for PR 18982 at commit 482c025.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

No problem @holdenk, I updated using transform() on the test. See if it looks ok to you now (pending Jenkins). Thanks!

@BryanCutler
Copy link
Member Author

Hmmm, I can repeat the error with Python3, I'll look into it tomorrow

@holdenk
Copy link
Contributor

holdenk commented Sep 8, 2017

Cool, let me know if you want a hand reproing the error.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81574 has finished for PR 18982 at commit 1a9fa46.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@holdenk , the error was because I was calling setDefault with a separate Param and value. So for something like a seed param, the Long value would get converted to an Int in the exchange to Java if using Python 3 and then cause a cast error. The only solution I could find was to create another method in Scala that takes a single ParamPair value because I just couldn't get the existing setDefault for a list of ParamPairs to work. Let me know if this sounds ok or you have another idea. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81577 has finished for PR 18982 at commit 088ee52.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81647 has finished for PR 18982 at commit 088ee52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2017

Test build #83691 has finished for PR 18982 at commit 088ee52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2017

Can we update this to master?

@SparkQA
Copy link

SparkQA commented Nov 20, 2017

Test build #84035 has finished for PR 18982 at commit bef3fb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Hi @holdenk , this is updated. Look ok to you?

@BryanCutler
Copy link
Member Author

ping @MLnick @holdenk , seems like this issue popped up a couple times recently. Can you please take a look here?

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87228 has finished for PR 18982 at commit 339c793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Mar 16, 2018

So between #20410 & this one which path do we want to go down?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

From my rough reading and based upon what I know, this seems fine.

@SparkQA
Copy link

SparkQA commented Mar 19, 2018

Test build #88369 has finished for PR 18982 at commit 339c793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88452 has finished for PR 18982 at commit 9162944.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @viirya too

@HyukjinKwon
Copy link
Member

retest this please

@BryanCutler
Copy link
Member Author

I think I figured out how to use params setDefault(paramPairs: ParamPair[_]*) from python so we won't need to add setDefaultPair and the MiMa exclusion. Let me just do a little more testing to make sure.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88470 has finished for PR 18982 at commit 9162944.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88475 has finished for PR 18982 at commit 24a1dbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88481 has finished for PR 18982 at commit 2eaf1a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Thanks @holdenk and @HyukjinKwon ! I made a small change so that this can call the existing setDefault method and can avoid adding anything new to the Scala side. It's a little less clean, but I think better not to add anything if it can be avoided for this case. Please take another look when you can.

if len(pair_defaults) > 0:
sc = SparkContext._active_spark_context
pair_defaults_seq = sc._jvm.PythonUtils.toSeq(pair_defaults)
self._java_obj.setDefault(pair_defaults_seq)
Copy link
Member

@viirya viirya Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If java side and python side the default params are the same, do we still need to set default params for the java object? Are't they already set in java object if they are default params?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that while they should be the same, it's still possible they might not be. The user could extend their own classes or it's quite easy to change in Python. Although we don't really support this, if there was a mismatch the user would probably just get bad results and it would be really hard to figure out why. From the Python API, it would look like it was one value but actually using another in Scala.

If you all think it's overly cautious to do this, I can take it out. I just thought it would be cheap insurance to just set these values regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable, a few extra lines to avoid potential unwanted user surprise is worth it.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@holdenk
Copy link
Contributor

holdenk commented Mar 23, 2018

Merged to master.

@asfgit asfgit closed this in cb43bbe Mar 23, 2018
@BryanCutler
Copy link
Member Author

Thanks @holdenk @HyukjinKwon and @viirya !

@BryanCutler BryanCutler deleted the pyspark-ml-param-to-java-defaults-SPARK-21685 branch November 19, 2018 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants