-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe #16782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe #16782
Conversation
|
Ping @holdenk @davies . I reproduced the code in the JIRA and found that kwargs from one thread were getting overwritten by another, causing a |
|
Test build #72292 has finished for PR 16782 at commit
|
|
Test build #3586 has finished for PR 16782 at commit
|
|
Thanks @BryanCutler for the patch! The fix looks reasonable to me, but let me try to check with @davies to confirm. If this is the right approach, then I think we should update the other uses of _input_kwargs in pyspark.ml as well. |
python/pyspark/__init__.py
Outdated
| # NOTE - this assumes we are wrapping a method and args[0] will be 'self' | ||
| if len(args) > 1: | ||
| raise TypeError("Method %s forces keyword arguments." % func.__name__) | ||
| wrapper._input_kwargs = kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the assumption is correct, should we always use 'self' to hold the kwargs? (remove this line and update all the fuctions that use keyword_only)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is what I was suggesting only that removing that would require changing everywhere it is used in ml. So I just wanted to check with you guys first.
|
Thanks @jkbradley and @davies for reviewing. This fix still seems a little hacky to me and you could still possibly run into trouble if you call a nested wrapped function and don't consume the |
|
also, using the |
|
I'm OK with the current solution, though if it's easy to check using If there are cases in which the wrapper is still not thread-safe, then could you please document that in the wrapper? I worry about other parts of Spark adopting keyword_only without recognizing the thread safety issues. |
|
This patch is not a solution for pyspark users because all of the ML stages in the pipeline are also not threadsafe in their creation due to this same wrapper. Note that the wrapper does two separate things, enforces keywords only and passes the kwargs in an unsafe manner outside the call to the wrapped method. We can fix this by simply omitting the wrapper's second (apparently unneeded) feature. Another benefit of this omission is that wrapped functions do not need to be modified to use the wrapper (although the ML methods that have been already modified to depend upon the input_kwargs introduced by the defective wrapper must be switched back to using named arguments). Note this also would fix the bug in Pipeline where the init method's modifications to stages are lost. To illustrate this approach to a fix using minimalist code similar to Pipeline: `from functools import wraps def keyword_only(func): class Mytest: if name == "main":
initParams setParams nonKeyword arguments initParams with unexpected parameter |
|
Hi @avi8tr , what exactly about this proposed fix is not thread-safe? |
|
Hi, thanks for explaining that there is a purpose for the retention and passing of the user-supplied arguments outside of the function call (while not changing the public api). This fix enabling storage per-instance fits the usage model for threading in Spark -- one thread creates the pipeline and e.g. invokes .fit() -- but it stops short of a fix because it leaves in place the static class variable for all other ML classes that use the wrapper, and those classes continue to use the static class variable. That is the aspect of the patch that is not thread-safe. If this branch is merged, one still cannot reasonably create multiple ML pipelines in a threaded environment because the elements of the pipeline (its stages) are now known to be subject to the same bug. (The remaining nit is, what is supposed to happen to arguments, e.g. stages=, that are changed in the bodies of the wrapped methods? Currently, the changes are thrown away. This would seem to deserve at least a comment placed in the dead code.) |
I think this was discussed above: This WIP PR currently just changes the usage for Pipeline, but if the fix is OK for Pipeline, then @BryanCutler can update it for all models. Given the OK from @davies I recommend we proceed with the current fix (but using 'self' to hold the kwargs as mentioned above). With regards to using |
|
That's correct @jkbradley , thanks for clearing that up - I should have been more clear in the description. I'll go ahead and remove the static |
…nly-threadsafe-SPARK-19348
|
Test build #73710 has finished for PR 16782 at commit
|
|
Test build #73713 has finished for PR 16782 at commit
|
|
I think this is ready for a final review @jkbradley @davies - thanks! |
|
Test build #73709 has finished for PR 16782 at commit
|
|
@jkbradley I think that last test comment is from an older test that just took a while to finish, Test build #73713 is from the last commit and passed, but I can rerun just in case if you like. |
|
You're right about the test. I'll take a look now. |
|
Clever unit test : ) LGTM I'll try to backport it to branch-2.1 and branch-2.0 as well. |
|
Well, it merged with master, but it will need some manual backports. @BryanCutler Would you mind sending one for branch-2.1? I'm ambivalent about 2.0; your call (or anyone who's hit this on 2.0). Thank you! |
|
Sure, I can do a backport @jkbradley, will ping you when ready
…On Mar 3, 2017 4:46 PM, "asfgit" ***@***.***> wrote:
Closed #16782 <#16782> via 44281ca
<44281ca>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16782 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AEUwdTstyjGgeiL0DCxl-jVNy8bMHDjiks5riLRvgaJpZM4L1xbg>
.
|
What changes were proposed in this pull request?
The
@keyword_onlydecorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as_input_kwargs. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See SPARK-19348 for reproduction code.This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies
_input_kwargscan be overwritten from different class instances.