Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Feb 2, 2017

What changes were proposed in this pull request?

The @keyword_only decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as _input_kwargs. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See SPARK-19348 for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

How was this patch tested?

Added new unit tests for using the keyword_only decorator and a regression test that verifies _input_kwargs can be overwritten from different class instances.

@BryanCutler
Copy link
Member Author

Ping @holdenk @davies . I reproduced the code in the JIRA and found that kwargs from one thread were getting overwritten by another, causing a ml.Pipeline to be constructed with incorrect parameters. This is just a bit of a hacked solution, not sure what you would think about possibly removing wrapper._input_kwargs = kwargs for this instead, because it would require a lot of changes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72292 has finished for PR 16782 at commit 83bcce0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2017

Test build #3586 has finished for PR 16782 at commit 83bcce0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks @BryanCutler for the patch! The fix looks reasonable to me, but let me try to check with @davies to confirm.

If this is the right approach, then I think we should update the other uses of _input_kwargs in pyspark.ml as well.

# NOTE - this assumes we are wrapping a method and args[0] will be 'self'
if len(args) > 1:
raise TypeError("Method %s forces keyword arguments." % func.__name__)
wrapper._input_kwargs = kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the assumption is correct, should we always use 'self' to hold the kwargs? (remove this line and update all the fuctions that use keyword_only)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what I was suggesting only that removing that would require changing everywhere it is used in ml. So I just wanted to check with you guys first.

@BryanCutler
Copy link
Member Author

BryanCutler commented Feb 27, 2017

Thanks @jkbradley and @davies for reviewing. This fix still seems a little hacky to me and you could still possibly run into trouble if you call a nested wrapped function and don't consume the _input_kwargs right away - but that is never done in pyspark.ml as far as I can tell. But it is the best solution I could think of without being overly complicated and it is a little better than it was before. If you guys give the go ahead, I can update the other uses in pyspark.ml and try to add a test also.

@BryanCutler
Copy link
Member Author

also, using the inspection module it would be possible to check if the wrapped function is a method. Then we wouldn't need to just make that assumption.

@jkbradley
Copy link
Member

I'm OK with the current solution, though if it's easy to check using inspection then that seems nice to do.

If there are cases in which the wrapper is still not thread-safe, then could you please document that in the wrapper? I worry about other parts of Spark adopting keyword_only without recognizing the thread safety issues.

@avi8tr
Copy link

avi8tr commented Feb 28, 2017

This patch is not a solution for pyspark users because all of the ML stages in the pipeline are also not threadsafe in their creation due to this same wrapper. Note that the wrapper does two separate things, enforces keywords only and passes the kwargs in an unsafe manner outside the call to the wrapped method. We can fix this by simply omitting the wrapper's second (apparently unneeded) feature. Another benefit of this omission is that wrapped functions do not need to be modified to use the wrapper (although the ML methods that have been already modified to depend upon the input_kwargs introduced by the defective wrapper must be switched back to using named arguments). Note this also would fix the bug in Pipeline where the init method's modifications to stages are lost. To illustrate this approach to a fix using minimalist code similar to Pipeline:

`from functools import wraps

def keyword_only(func):
"""
A decorator that forces keyword arguments in the wrapped method
"""
@wraps(func)
def wrapper(*args, **kwargs):
if len(args) > 1:
raise TypeError("Method %s forces keyword arguments." % func.name)
return func(*args, **kwargs)
return wrapper

class Mytest:

@keyword_only
def __init__(self, stages=None):
    """
    __init__(self, stages=None)
    """
    self.setParams(stages=stages)

@keyword_only
def setParams(self, stages=None):
    """
    setParams(self, stages=None)
    Sets params for Pipeline.
    """
    if stages is None:
        stages = []
    return self._set(stages=stages)

def _set(self,**kwargs):
    for key,value in kwargs.items():
        print ('kwargs contains ' + key + ": " + str(value))

if name == "main":
print ()
print ('zero arguments')
baz = Mytest()
print ()
print ('initParams')
foo = Mytest(stages='initParams')
print ()
print ('setParams')
bar = Mytest()
bar.setParams(stages='setParams')
print ()
print ('nonKeyword arguments')
try:
bar = Mytest('nokeywords')
except Exception as e:
print ('Exception: '+e.args[0])

print ()
print ('initParams with unexpected parameter')
try:
    bat = Mytest(stages='initParams', unexpectedParameter='foo')
except Exception as e:
    print ('Exception: '+e.args[0])

the output of which is:zero arguments
kwargs contains stages: []

initParams
kwargs contains stages: initParams

setParams
kwargs contains stages: []
kwargs contains stages: setParams

nonKeyword arguments
Exception: Method init forces keyword arguments.

initParams with unexpected parameter
Exception: init() got an unexpected keyword argument 'unexpectedParameter'
`

@BryanCutler
Copy link
Member Author

Hi @avi8tr , what exactly about this proposed fix is not thread-safe? _input_kwargs also performs another function which is to only contain the params explicitly set by the user. These get passed along separately with a complete set of default values to the underlying pyspark.ml param maps. What you are proposing does not account for that and would require major restructuring as many of the pyspark.ml classes have huge param lists.

@avi8tr
Copy link

avi8tr commented Feb 28, 2017

Hi, thanks for explaining that there is a purpose for the retention and passing of the user-supplied arguments outside of the function call (while not changing the public api). This fix enabling storage per-instance fits the usage model for threading in Spark -- one thread creates the pipeline and e.g. invokes .fit() -- but it stops short of a fix because it leaves in place the static class variable for all other ML classes that use the wrapper, and those classes continue to use the static class variable. That is the aspect of the patch that is not thread-safe. If this branch is merged, one still cannot reasonably create multiple ML pipelines in a threaded environment because the elements of the pipeline (its stages) are now known to be subject to the same bug. (The remaining nit is, what is supposed to happen to arguments, e.g. stages=, that are changed in the bodies of the wrapped methods? Currently, the changes are thrown away. This would seem to deserve at least a comment placed in the dead code.)

@jkbradley
Copy link
Member

it leaves in place the static class variable for all other ML classes that use the wrapper, and those classes continue to use the static class variable.

I think this was discussed above: This WIP PR currently just changes the usage for Pipeline, but if the fix is OK for Pipeline, then @BryanCutler can update it for all models.

Given the OK from @davies I recommend we proceed with the current fix (but using 'self' to hold the kwargs as mentioned above). With regards to using inspection, I say we just add a note to the keyword_only wrapper about only using it for methods.

@BryanCutler
Copy link
Member Author

That's correct @jkbradley , thanks for clearing that up - I should have been more clear in the description. I'll go ahead and remove the static _input_kwargs and update the remaining uses. I doesn't look like inspection is in Python 2.6 anyway so I'll put a note in the docstring instead.

@BryanCutler BryanCutler changed the title [SPARK-19348][PYTHON][WIP] PySpark keyword_only decorator is not thread-safe [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe Mar 1, 2017
@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73710 has finished for PR 16782 at commit acf9cf4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73713 has finished for PR 16782 at commit e578320.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

I think this is ready for a final review @jkbradley @davies - thanks!

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73709 has finished for PR 16782 at commit 8dafc20.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class KeywordOnlyTests(unittest.TestCase):
  • class Wrapped(object):
  • class Setter(object):

@BryanCutler
Copy link
Member Author

@jkbradley I think that last test comment is from an older test that just took a while to finish, Test build #73713 is from the last commit and passed, but I can rerun just in case if you like.

@jkbradley
Copy link
Member

You're right about the test. I'll take a look now.

@jkbradley
Copy link
Member

Clever unit test : )

LGTM
Merging with master

I'll try to backport it to branch-2.1 and branch-2.0 as well.

@jkbradley
Copy link
Member

Well, it merged with master, but it will need some manual backports. @BryanCutler Would you mind sending one for branch-2.1? I'm ambivalent about 2.0; your call (or anyone who's hit this on 2.0).

Thank you!

@asfgit asfgit closed this in 44281ca Mar 4, 2017
@BryanCutler
Copy link
Member Author

BryanCutler commented Mar 6, 2017 via email

@BryanCutler BryanCutler deleted the pyspark-keyword_only-threadsafe-SPARK-19348 branch August 2, 2017 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants