New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982
[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform #18982
Conversation
Test build #80813 has finished for PR 18982 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, two quick questions.
python/pyspark/ml/wrapper.py
Outdated
self._java_obj.set(pair) | ||
if param in self._defaultParamMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an else if? No need to transfer the default value if we've explicitly set it to another value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually make the assumption that Python defines the same default values as Java, in Spark ML at least, but given the circumstances of the JIRA - they defined their own Model - then it's still possible for hasDefault
or the default value to return something different that Python would. So I'm just being overly cautious here, but it's pretty cheap to just transfer the default values anyway right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable.
python/pyspark/ml/tests.py
Outdated
def test_preserve_set_state(self): | ||
model = Binarizer() | ||
self.assertFalse(model.isSet("threshold")) | ||
model._transfer_params_to_java() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do an actual transform here instead of the two inner parts of the transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it would be a little better to call the actual transform
, but we would still need to call _transfer_params_from_java
or check isSet
with a direct call to Java via py4j. I was going to do this, but the ParamTest
class doesn't already create a SparkSession
- I'm sure it's just a small amount of overhead but that's why I thought to just use _transfer_params_to_java
.
Do you think it would be worth it to change ParamTests
to inherit from SparkSessionTestCase
so a session is created and I could make a DataFrame
to transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a reasonable thing to do, the slight increase in testing overhead is probably worth it, it keeps us from being too closely tied to the implementation details and we already use SparkSessionTestCase
in a lot of places.
Thanks for reviewing @holdenk ! You brought up some good points, let me know if you prefer me to change them. |
@BryanCutler sorry my slowness, responded with a note on the tests. If you have a chance to update the test and update to master I'd love to try and get this in. |
…to-java-defaults-SPARK-21685
Test build #81484 has finished for PR 18982 at commit
|
Jenkins retest this please |
Test build #81487 has finished for PR 18982 at commit
|
No problem @holdenk, I updated using |
Hmmm, I can repeat the error with Python3, I'll look into it tomorrow |
Cool, let me know if you want a hand reproing the error. |
Test build #81574 has finished for PR 18982 at commit
|
@holdenk , the error was because I was calling |
Test build #81577 has finished for PR 18982 at commit
|
retest this please |
Test build #81647 has finished for PR 18982 at commit
|
Test build #83691 has finished for PR 18982 at commit
|
Can we update this to master? |
…to-java-defaults-SPARK-21685
Test build #84035 has finished for PR 18982 at commit
|
Hi @holdenk , this is updated. Look ok to you? |
…to-java-defaults-SPARK-21685
Test build #87228 has finished for PR 18982 at commit
|
So between #20410 & this one which path do we want to go down? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
retest this please |
From my rough reading and based upon what I know, this seems fine. |
Test build #88369 has finished for PR 18982 at commit
|
…to-java-defaults-SPARK-21685
Test build #88452 has finished for PR 18982 at commit
|
cc @viirya too |
retest this please |
I think I figured out how to use params |
Test build #88470 has finished for PR 18982 at commit
|
Test build #88475 has finished for PR 18982 at commit
|
Test build #88481 has finished for PR 18982 at commit
|
Thanks @holdenk and @HyukjinKwon ! I made a small change so that this can call the existing |
if len(pair_defaults) > 0: | ||
sc = SparkContext._active_spark_context | ||
pair_defaults_seq = sc._jvm.PythonUtils.toSeq(pair_defaults) | ||
self._java_obj.setDefault(pair_defaults_seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If java side and python side the default params are the same, do we still need to set default params for the java object? Are't they already set in java object if they are default params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that while they should be the same, it's still possible they might not be. The user could extend their own classes or it's quite easy to change in Python. Although we don't really support this, if there was a mismatch the user would probably just get bad results and it would be really hard to figure out why. From the Python API, it would look like it was one value but actually using another in Scala.
If you all think it's overly cautious to do this, I can take it out. I just thought it would be cheap insurance to just set these values regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reasonable, a few extra lines to avoid potential unwanted user surprise is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged to master. |
Thanks @holdenk @HyukjinKwon and @viirya ! |
What changes were proposed in this pull request?
Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to
wrapper._transfer_values_to_java
. This incorrectly changes the state of the Param as it should still be marked as a default value only.How was this patch tested?
Added a new test to verify that when transferring Params to Java, default params have their state preserved.