[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe #16782

BryanCutler · 2017-02-02T23:26:36Z

What changes were proposed in this pull request?

The @keyword_only decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as _input_kwargs. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See SPARK-19348 for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

How was this patch tested?

Added new unit tests for using the keyword_only decorator and a regression test that verifies _input_kwargs can be overwritten from different class instances.

BryanCutler · 2017-02-02T23:31:41Z

Ping @holdenk @davies . I reproduced the code in the JIRA and found that kwargs from one thread were getting overwritten by another, causing a ml.Pipeline to be constructed with incorrect parameters. This is just a bit of a hacked solution, not sure what you would think about possibly removing wrapper._input_kwargs = kwargs for this instead, because it would require a lot of changes.

SparkQA · 2017-02-03T00:03:02Z

Test build #72292 has finished for PR 16782 at commit 83bcce0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-26T20:47:30Z

Test build #3586 has finished for PR 16782 at commit 83bcce0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-02-26T22:12:12Z

Thanks @BryanCutler for the patch! The fix looks reasonable to me, but let me try to check with @davies to confirm.

If this is the right approach, then I think we should update the other uses of _input_kwargs in pyspark.ml as well.

davies · 2017-02-27T18:27:35Z

python/pyspark/__init__.py

+        # NOTE - this assumes we are wrapping a method and args[0] will be 'self'
        if len(args) > 1:
            raise TypeError("Method %s forces keyword arguments." % func.__name__)
        wrapper._input_kwargs = kwargs


If the assumption is correct, should we always use 'self' to hold the kwargs? (remove this line and update all the fuctions that use keyword_only)?

Yeah, that is what I was suggesting only that removing that would require changing everywhere it is used in ml. So I just wanted to check with you guys first.

BryanCutler · 2017-02-27T18:42:30Z

Thanks @jkbradley and @davies for reviewing. This fix still seems a little hacky to me and you could still possibly run into trouble if you call a nested wrapped function and don't consume the _input_kwargs right away - but that is never done in pyspark.ml as far as I can tell. But it is the best solution I could think of without being overly complicated and it is a little better than it was before. If you guys give the go ahead, I can update the other uses in pyspark.ml and try to add a test also.

BryanCutler · 2017-02-27T18:45:02Z

also, using the inspection module it would be possible to check if the wrapped function is a method. Then we wouldn't need to just make that assumption.

jkbradley · 2017-02-27T19:56:22Z

I'm OK with the current solution, though if it's easy to check using inspection then that seems nice to do.

If there are cases in which the wrapper is still not thread-safe, then could you please document that in the wrapper? I worry about other parts of Spark adopting keyword_only without recognizing the thread safety issues.

avi8tr · 2017-02-28T13:08:44Z

This patch is not a solution for pyspark users because all of the ML stages in the pipeline are also not threadsafe in their creation due to this same wrapper. Note that the wrapper does two separate things, enforces keywords only and passes the kwargs in an unsafe manner outside the call to the wrapped method. We can fix this by simply omitting the wrapper's second (apparently unneeded) feature. Another benefit of this omission is that wrapped functions do not need to be modified to use the wrapper (although the ML methods that have been already modified to depend upon the input_kwargs introduced by the defective wrapper must be switched back to using named arguments). Note this also would fix the bug in Pipeline where the init method's modifications to stages are lost. To illustrate this approach to a fix using minimalist code similar to Pipeline:

`from functools import wraps

def keyword_only(func):
"""
A decorator that forces keyword arguments in the wrapped method
"""
@wraps(func)
def wrapper(*args, **kwargs):
if len(args) > 1:
raise TypeError("Method %s forces keyword arguments." % func.name)
return func(*args, **kwargs)
return wrapper

class Mytest:

@keyword_only
def __init__(self, stages=None):
    """
    __init__(self, stages=None)
    """
    self.setParams(stages=stages)

@keyword_only
def setParams(self, stages=None):
    """
    setParams(self, stages=None)
    Sets params for Pipeline.
    """
    if stages is None:
        stages = []
    return self._set(stages=stages)

def _set(self,**kwargs):
    for key,value in kwargs.items():
        print ('kwargs contains ' + key + ": " + str(value))

if name == "main":
print ()
print ('zero arguments')
baz = Mytest()
print ()
print ('initParams')
foo = Mytest(stages='initParams')
print ()
print ('setParams')
bar = Mytest()
bar.setParams(stages='setParams')
print ()
print ('nonKeyword arguments')
try:
bar = Mytest('nokeywords')
except Exception as e:
print ('Exception: '+e.args[0])

print ()
print ('initParams with unexpected parameter')
try:
    bat = Mytest(stages='initParams', unexpectedParameter='foo')
except Exception as e:
    print ('Exception: '+e.args[0])

the output of which is:zero arguments
kwargs contains stages: []

initParams
kwargs contains stages: initParams

setParams
kwargs contains stages: []
kwargs contains stages: setParams

nonKeyword arguments
Exception: Method init forces keyword arguments.

initParams with unexpected parameter
Exception: init() got an unexpected keyword argument 'unexpectedParameter'
`

BryanCutler · 2017-02-28T18:21:14Z

Hi @avi8tr , what exactly about this proposed fix is not thread-safe? _input_kwargs also performs another function which is to only contain the params explicitly set by the user. These get passed along separately with a complete set of default values to the underlying pyspark.ml param maps. What you are proposing does not account for that and would require major restructuring as many of the pyspark.ml classes have huge param lists.

avi8tr · 2017-02-28T19:57:32Z

Hi, thanks for explaining that there is a purpose for the retention and passing of the user-supplied arguments outside of the function call (while not changing the public api). This fix enabling storage per-instance fits the usage model for threading in Spark -- one thread creates the pipeline and e.g. invokes .fit() -- but it stops short of a fix because it leaves in place the static class variable for all other ML classes that use the wrapper, and those classes continue to use the static class variable. That is the aspect of the patch that is not thread-safe. If this branch is merged, one still cannot reasonably create multiple ML pipelines in a threaded environment because the elements of the pipeline (its stages) are now known to be subject to the same bug. (The remaining nit is, what is supposed to happen to arguments, e.g. stages=, that are changed in the bodies of the wrapped methods? Currently, the changes are thrown away. This would seem to deserve at least a comment placed in the dead code.)

jkbradley · 2017-02-28T22:50:01Z

it leaves in place the static class variable for all other ML classes that use the wrapper, and those classes continue to use the static class variable.

I think this was discussed above: This WIP PR currently just changes the usage for Pipeline, but if the fix is OK for Pipeline, then @BryanCutler can update it for all models.

Given the OK from @davies I recommend we proceed with the current fix (but using 'self' to hold the kwargs as mentioned above). With regards to using inspection, I say we just add a note to the keyword_only wrapper about only using it for methods.

BryanCutler · 2017-03-01T00:12:59Z

That's correct @jkbradley , thanks for clearing that up - I should have been more clear in the description. I'll go ahead and remove the static _input_kwargs and update the remaining uses. I doesn't look like inspection is in Python 2.6 anyway so I'll put a note in the docstring instead.

…ut_kwargs to class instance

…nly-threadsafe-SPARK-19348

SparkQA · 2017-03-01T20:35:02Z

Test build #73710 has finished for PR 16782 at commit acf9cf4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T22:39:36Z

Test build #73713 has finished for PR 16782 at commit e578320.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-03-01T22:57:59Z

I think this is ready for a final review @jkbradley @davies - thanks!

SparkQA · 2017-03-01T23:01:59Z

Test build #73709 has finished for PR 16782 at commit 8dafc20.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class KeywordOnlyTests(unittest.TestCase):
class Wrapped(object):
class Setter(object):

BryanCutler · 2017-03-03T23:11:00Z

@jkbradley I think that last test comment is from an older test that just took a while to finish, Test build #73713 is from the last commit and passed, but I can rerun just in case if you like.

jkbradley · 2017-03-04T00:26:58Z

You're right about the test. I'll take a look now.

jkbradley · 2017-03-04T00:42:52Z

Clever unit test : )

LGTM
Merging with master

I'll try to backport it to branch-2.1 and branch-2.0 as well.

jkbradley · 2017-03-04T00:45:17Z

Well, it merged with master, but it will need some manual backports. @BryanCutler Would you mind sending one for branch-2.1? I'm ambivalent about 2.0; your call (or anyone who's hit this on 2.0).

Thank you!

BryanCutler · 2017-03-06T06:12:47Z

Sure, I can do a backport @jkbradley, will ping you when ready

…

On Mar 3, 2017 4:46 PM, "asfgit" ***@***.***> wrote: Closed #16782 <#16782> via 44281ca <44281ca> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16782 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdTstyjGgeiL0DCxl-jVNy8bMHDjiks5riLRvgaJpZM4L1xbg> .

BryanCutler added 2 commits February 2, 2017 15:01

made a thread-safe addition to keyword_only decorator

a15e9b0

remove blank line

83bcce0

davies reviewed Feb 27, 2017

View reviewed changes

BryanCutler added 3 commits February 28, 2017 20:56

changed keyword_only decorator to wrap method with self and save _inp…

b644f35

…ut_kwargs to class instance

changed _input_kwargs to class member in pyspark-ml usage

07ef3e9

added keyword_only tests, including a regression test

8dafc20

BryanCutler changed the title ~~[SPARK-19348][PYTHON][WIP] PySpark keyword_only decorator is not thread-safe~~ [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe Mar 1, 2017

Merge remote-tracking branch 'upstream/master' into pyspark-keyword_o…

acf9cf4

…nly-threadsafe-SPARK-19348

fixed _input_kwargs after merge

e578320

asfgit closed this in 44281ca Mar 4, 2017

BryanCutler deleted the pyspark-keyword_only-threadsafe-SPARK-19348 branch August 2, 2017 23:36

[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe #16782

[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe #16782

Uh oh!

Conversation

BryanCutler commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 26, 2017

Uh oh!

jkbradley commented Feb 26, 2017

Uh oh!

davies Feb 27, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler Feb 27, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Feb 27, 2017

Uh oh!

jkbradley commented Feb 27, 2017

Uh oh!

avi8tr commented Feb 28, 2017

Uh oh!

BryanCutler commented Feb 28, 2017

Uh oh!

avi8tr commented Feb 28, 2017

Uh oh!

jkbradley commented Feb 28, 2017

Uh oh!

BryanCutler commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

BryanCutler commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

BryanCutler commented Mar 3, 2017

Uh oh!

jkbradley commented Mar 4, 2017

Uh oh!

jkbradley commented Mar 4, 2017

Uh oh!

jkbradley commented Mar 4, 2017

Uh oh!

BryanCutler commented Mar 6, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BryanCutler commented Feb 2, 2017 •

edited

Loading

BryanCutler commented Feb 27, 2017 •

edited

Loading