New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly #4146
Conversation
Do you mind opening a JIRA issue for this and updating your pull request title to reference that issue (e.g. [SPARK-XXX] My PR title…)? |
Can one of the admins verify this patch? |
@JoshRosen Updated. Thanks |
Jenkins, this is ok to test. |
Test build #25925 has started for PR 4146 at commit
|
Test build #25925 has finished for PR 4146 at commit
|
Test FAILed. |
checking |
Oh, this looks like a spurious failure due to some Jenkins flakiness (which I'm going to investigate separately). |
Can you add a regression test for this? See |
@JoshRosen Sure thing. I did not see any similar regression test case in this file though. Thanks |
If it was me, I'd probably put it as a new case in the |
great! I will follow it then. |
Found a good way to reproduce it:
I am going to make a test case from this example. |
Test build #25977 has started for PR 4146 at commit
|
Test build #25977 has finished for PR 4146 at commit
|
Test FAILed. |
@JoshRosen Added in a test case. Thanks |
Test build #25978 has started for PR 4146 at commit
|
Test build #25978 has finished for PR 4146 at commit
|
Test FAILed. |
Test build #25981 has started for PR 4146 at commit
|
Test build #25981 has finished for PR 4146 at commit
|
Test PASSed. |
@JoshRosen Anything else I can do here? |
@@ -153,7 +153,10 @@ private[spark] object SerDeUtil extends Logging { | |||
iter.flatMap { row => | |||
val obj = unpickle.loads(row) | |||
if (batched) { | |||
obj.asInstanceOf[JArrayList[_]].asScala | |||
obj match { | |||
case array: Array[Any] => array.toList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that array.toList
constructs a Scala List
class, which is basically a linked list. Is there a reason why we can't call toSeq
instead, or can't simply return the array?
This seems good to me overall; I left one minor comment regarding a small performance optimization, but aside from that this seems good to me. Sorry for the delay in review. |
Test build #26123 has started for PR 4146 at commit
|
@JoshRosen updated to |
Test build #26123 has finished for PR 4146 at commit
|
Test PASSed. |
@JoshRosen Can we have this in the next release? We will have to use our own fork if it's not in. Thanks |
Let me take one final look to see if I can pull this in for 1.2.1 (since we're cutting a new RC tonight). In general, this looks safe since only adds new code paths in cases where we'd otherwise throw exception, as opposed to changing the behavior of existing code paths. If things check out, I'll pull it in for both 1.3.0 and 1.2.1. |
@wingchen Actually, just to be clear here, is this problem related to tuple handling, or is the actual issue related to multiple Java <-> Python conversions not working correctly? If there's nothing tuple-specific about this, do you mind editing the PR title, description, and JIRA to reflect this? |
@JoshRosen I found this one reading RDD from But it seems that |
@JoshRosen updated. thanks :) |
@wingchen Thanks for updating this. The new description + code both look good to me, so I'm going to merge this into |
@JoshRosen Thanks a lot for your help. |
I've cherry-picked this into |
…correctly This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark. It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` The test case code below reproduces it: ``` from pyspark.rdd import RDD dl = [ (u'2', {u'director': u'David Lean'}), (u'7', {u'director': u'Andrew Dominik'}) ] dl_rdd = sc.parallelize(dl) tmp = dl_rdd._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() tmp = t._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() # it blows up here during the 2nd time of conversion ``` Author: Winston Chen <wchen@quid.com> Closes #4146 from wingchen/master and squashes the following commits: 903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR 5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks 126be6b [Winston Chen] SPARK-5361, add in test case 4cf1187 [Winston Chen] SPARK-5361, add in test case 9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD (cherry picked from commit 453d799) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
…correctly This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark. It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` The test case code below reproduces it: ``` from pyspark.rdd import RDD dl = [ (u'2', {u'director': u'David Lean'}), (u'7', {u'director': u'Andrew Dominik'}) ] dl_rdd = sc.parallelize(dl) tmp = dl_rdd._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() tmp = t._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() # it blows up here during the 2nd time of conversion ``` Author: Winston Chen <wchen@quid.com> Closes apache#4146 from wingchen/master and squashes the following commits: 903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR 5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks 126be6b [Winston Chen] SPARK-5361, add in test case 4cf1187 [Winston Chen] SPARK-5361, add in test case 9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD (cherry picked from commit 453d799) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
This is found through reading RDD from
sc.newAPIHadoopRDD
and writing it back usingrdd.saveAsNewAPIHadoopFile
in pyspark.It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:
The test case code below reproduces it: