Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark broadcasts when using multiple threads #17722

Closed
wants to merge 1 commit into from

Conversation

vundela
Copy link

@vundela vundela commented Apr 21, 2017

What changes were proposed in this pull request?

In pyspark when multiple threads are used, broadcast variables are pickled with wrong PythonRDD wrap functions which leads to the following exception(Because of the race condition between the threads on java side with py4j).

16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py", line 39, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '6' not loaded!",), <function _from_id at 0xce7a28>, (6L,))

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This change will fix the race condition in branch-1.6 by making sure that broadcast variables are pickled with same pythonRDD function.

How was this patch tested?

Reproduced the issue mentioned in SPARK-12717, following the instructions specified in jira
Make sure that issue is fixed with the changes

Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maver1ck
Copy link
Contributor

Hi,
What about this issue ?

@HyukjinKwon
Copy link
Member

One thing I am sure is, @BryanCutler proposed, IMHO, a better approach and merged into master, 2.2 and 2.1. I guess this PR should be closed at least.

srowen added a commit to srowen/spark that referenced this pull request Sep 12, 2017
@srowen srowen mentioned this pull request Sep 12, 2017
@asfgit asfgit closed this in dd88fa3 Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants