[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry #18695

BryanCutler · 2017-07-20T21:52:13Z

What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment, SparkContext._pickled_broadcast_vars becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared _pickled_broadcast_vars and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.

How was this patch tested?

Added a unit test that causes this race condition using another thread.

SparkQA · 2017-07-20T22:28:00Z

Test build #79809 has finished for PR 18695 at commit b703f83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class BroadcastPickleRegistry(object):

BryanCutler · 2017-07-20T23:52:15Z

ping @holdenk @davies , does this fix look ok to you? Thanks!

HyukjinKwon · 2017-07-21T06:51:12Z

python/pyspark/broadcast.py

+        self._registry = set()
+        self._lock = lock
+
+    @property


Would you mind if i ask why this one should be a property?

sure @HyukjinKwon, it's not really necessary. It's just there to basically say that the lock should not be changed once this class is instantiated, so to keep it "private" but allowed to be acquired. Maybe it's overkill here because this is not a widely used class with a very specific use. I could remove it if that makes things easier.

viirya · 2017-07-26T08:56:22Z

python/pyspark/context.py

@@ -195,7 +195,7 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize,
        # This allows other code to determine which Broadcast instances have
        # been pickled, so it can determine which Java broadcast objects to
        # send.
-        self._pickled_broadcast_vars = set()
+        self._pickled_broadcast_registry = BroadcastPickleRegistry(self._lock)


Instead of using lock, how about use thread local data? So we don't block other threads when pickling.

BryanCutler · 2017-07-26T20:41:21Z

Thanks @viirya , that was a good idea! I updated to use a thread-local object to store the pickled vars

SparkQA · 2017-07-26T21:12:56Z

Test build #79974 has finished for PR 18695 at commit 54e8357.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-07-28T04:06:28Z

The change LGTM. Will it be hard to add a reliable test for this?

BryanCutler · 2017-07-28T16:50:31Z

Yeah, I think I can add a simple test for this. I'll give it try.

SparkQA · 2017-07-31T23:08:26Z

Test build #80098 has finished for PR 18695 at commit 0f444f6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-01T00:03:46Z

Test build #80100 has finished for PR 18695 at commit 6710f13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-01T00:43:40Z

@viirya @HyukjinKwon , I added a test for this although maybe doesn't look as straightforward as I was thinking :) Could you take a look and see if it makes sense? Thanks!

viirya · 2017-08-01T01:49:13Z

python/pyspark/tests.py

+        def process_vars(sc):
+            broadcast_vars = [x for x in sc._pickled_broadcast_vars]
+            num_pickled = len(broadcast_vars)
+            sc._pickled_broadcast_vars.clear()


Shall we check if picked vars are actually cleared?

yeah, that would be good

viirya · 2017-08-01T01:50:07Z

LGTM except for one minor comment.

HyukjinKwon · 2017-08-01T03:46:25Z

python/pyspark/broadcast.py

@@ -139,6 +140,24 @@ def __reduce__(self):
        return _from_id, (self._jbroadcast.id(),)


+class BroadcastPickleRegistry(threading.local):


Hm.. actually, I prefer the locking way before.. I guess It wouldn't be big performance differences due to GIL and simple lock was easy to read ...

BTW, I am okay with the current way too.

I'm ok for both ways. :)

My only concern is in previous locking way is we lock it for dumping the command. I'm not sure if the dumping can take long time for big command so we prevent other threads to preparing their commands.

Yea, this anyway solves the issue and looks apparently safe from your concern. Probably, will make a follow up after testing it (quite) later.

Using the lock was a little more obvious what is going on, but it's better to not use a lock in case of pickling a large command like @viirya said. Also, this way doesn't need to change any of the pickling code, so I prefer it too.

SparkQA · 2017-08-01T03:48:02Z

Test build #80104 has finished for PR 18695 at commit d4d1fed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-01T03:55:07Z

Last question to check if I read correctly. So, the problem is around _prepare_for_python_RDD in rdd.py basically because it adds the pickled ones into the registry when it's pickled but this should be thread-safe?
Let's fix up the description to describe a thread local. LGTM too

HyukjinKwon · 2017-08-01T03:55:34Z

Will merge this one after few days if there are no more comments from @holdenk and @davies.

BryanCutler · 2017-08-01T17:25:45Z

Thanks @HyukjinKwon and @viirya! I updated the description. To sum up this issue, _prepare_for_python_RDD will pickle a command, which pickles any broadcast variables part of that command. When a broadcast var is pickled, the __reduce__ function adds itself to a common registry. After pickling, _prepare_for_python_RDD will get all broadcast vars that were just pickled and use them for that command. If multiple threads are both writing to the same pickled registry, then broadcast variables for different commands can be pickled together and added to the registry. The first thread to make the call to get them from the registry will get them all. Making the pickled registry thread local will give each thread it's own view to write and read from, so it's no longer a shared resource.

HyukjinKwon · 2017-08-01T22:04:50Z

Thanks for clarification. LGTM

HyukjinKwon · 2017-08-01T22:15:41Z

Thanks. Merged to master.

HyukjinKwon · 2017-08-01T22:35:54Z

@holdenk, BTW, it looks I am facing the same issue you met before. Sounds I can't trigger the Jenkins build by "ok to test". Do you maybe know who I should ask and or some steps that I should take?

felixcheung · 2017-08-02T06:18:00Z

@HyukjinKwon that needs to be added separately by someone who has access to Jenkins as admin

HyukjinKwon · 2017-08-02T07:40:00Z

Hmm.. I see. Thanks.

When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#18695 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717.

Added thread-safe broadcast pickle registry

b703f83

BryanCutler mentioned this pull request Jul 20, 2017

[SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads #17694

Closed

HyukjinKwon reviewed Jul 21, 2017

View reviewed changes

viirya reviewed Jul 26, 2017

View reviewed changes

BryanCutler added 2 commits July 26, 2017 13:37

changed to use thread local storage

375906d

renamed back to original

54e8357

added regression test for multithreaded broadcast pickle

0f444f6

fixed python style

6710f13

viirya reviewed Aug 1, 2017

View reviewed changes

added check that pickled vars is cleared

d4d1fed

HyukjinKwon reviewed Aug 1, 2017

View reviewed changes

asfgit closed this in 77cc0d6 Aug 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry #18695

[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry #18695

BryanCutler commented Jul 20, 2017 •

edited

Loading

SparkQA commented Jul 20, 2017

BryanCutler commented Jul 20, 2017

HyukjinKwon Jul 21, 2017

BryanCutler Jul 21, 2017

viirya Jul 26, 2017 •

edited

Loading

BryanCutler commented Jul 26, 2017

SparkQA commented Jul 26, 2017

viirya commented Jul 28, 2017

BryanCutler commented Jul 28, 2017

SparkQA commented Jul 31, 2017

SparkQA commented Aug 1, 2017

BryanCutler commented Aug 1, 2017

viirya Aug 1, 2017

BryanCutler Aug 1, 2017

viirya commented Aug 1, 2017

HyukjinKwon Aug 1, 2017

HyukjinKwon Aug 1, 2017

viirya Aug 1, 2017

HyukjinKwon Aug 1, 2017

BryanCutler Aug 1, 2017

SparkQA commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

BryanCutler commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017 •

edited

Loading

HyukjinKwon commented Aug 1, 2017

felixcheung commented Aug 2, 2017

HyukjinKwon commented Aug 2, 2017

		@@ -139,6 +140,24 @@ def __reduce__(self):
		return _from_id, (self._jbroadcast.id(),)


		class BroadcastPickleRegistry(threading.local):

[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry #18695

[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry #18695

Conversation

BryanCutler commented Jul 20, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 20, 2017

BryanCutler commented Jul 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

BryanCutler commented Jul 26, 2017

SparkQA commented Jul 26, 2017

viirya commented Jul 28, 2017

BryanCutler commented Jul 28, 2017

SparkQA commented Jul 31, 2017

SparkQA commented Aug 1, 2017

BryanCutler commented Aug 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Aug 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

BryanCutler commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017

HyukjinKwon commented Aug 1, 2017 • edited Loading

HyukjinKwon commented Aug 1, 2017

felixcheung commented Aug 2, 2017

HyukjinKwon commented Aug 2, 2017

BryanCutler commented Jul 20, 2017 •

edited

Loading

viirya Jul 26, 2017 •

edited

Loading

HyukjinKwon commented Aug 1, 2017 •

edited

Loading