[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

vundela · 2017-04-21T18:45:38Z

What changes were proposed in this pull request?

In pyspark when multiple threads are used, broadcast variables are pickled with wrong PythonRDD wrap functions which leads to the following exception(Because of the race condition between the threads on java side with py4j).

16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py", line 39, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '6' not loaded!",), <function _from_id at 0xce7a28>, (6L,))

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This change will fix the race condition in branch-1.6 by making sure that broadcast variables are pickled with same pythonRDD function.

How was this patch tested?

Reproduced the issue mentioned in SPARK-12717, following the instructions specified in jira
Make sure that issue is fixed with the changes

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ts when using multiple threads

AmplabJenkins · 2017-04-21T18:47:15Z

Can one of the admins verify this patch?

maver1ck · 2017-07-19T08:01:56Z

Hi,
What about this issue ?

HyukjinKwon · 2017-08-07T12:56:29Z

One thing I am sure is, @BryanCutler proposed, IMHO, a better approach and merged into master, 2.2 and 2.1. I guess this PR should be closed at least.

Closes apache#18522 Closes apache#17722 Closes apache#18879 Closes apache#18891 Closes apache#18806 Closes apache#18948 Closes apache#18949 Closes apache#19070 Closes apache#19039 Closes apache#19142 Closes apache#18515 Closes apache#19154 Closes apache#19162 Closes apache#19187

[SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcas…

af8e6d7

…ts when using multiple threads

srowen mentioned this pull request Sep 12, 2017

[BUILD] Close stale PRs #19203

Closed

asfgit closed this in dd88fa3 Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

vundela commented Apr 21, 2017 •

edited

Loading

AmplabJenkins commented Apr 21, 2017

maver1ck commented Jul 19, 2017

HyukjinKwon commented Aug 7, 2017

[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

Conversation

vundela commented Apr 21, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Apr 21, 2017

maver1ck commented Jul 19, 2017

HyukjinKwon commented Aug 7, 2017

vundela commented Apr 21, 2017 •

edited

Loading