[SPARK-2737] Add retag() method for changing RDDs' ClassTags. #1639

JoshRosen · 2014-07-29T21:25:03Z

The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197).

There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose.

Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers.

JoshRosen · 2014-07-29T21:27:57Z

/cc @mengxr @jkbradley @mateiz

SparkQA · 2014-07-29T21:28:47Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17381/consoleFull

SparkQA · 2014-07-29T21:29:30Z

QA results for PR 1639:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17381/consoleFull

The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197). There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose. Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers.

SparkQA · 2014-07-29T21:33:43Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17383/consoleFull

SparkQA · 2014-07-29T22:20:30Z

QA results for PR 1639:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17383/consoleFull

marmbrus · 2014-07-29T22:30:47Z

Another option would be to add collectSeq or something similar that returns a type with reasonable variance semantics.

mateiz · 2014-07-30T00:18:27Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+      override protected def getPartitions: Array[Partition] = oldRDD.getPartitions
+      override def compute(split: Partition, context: TaskContext): Iterator[T] =
+        oldRDD.compute(split, context)
+    }


You also need to preserve the Partitioner and such. It would be better to do this via this.mapPartitions with the preservePartitioning option set to true.

Would there be any performance impact of running mapPartitions(identity, preservesPartitioning = true)(classTag)? If we have an RDD that's persisted in a serialized format, wouldn't this extra map force an unnecessary deserialization?

Sure, the fix with just passing the partitioner also works.

Actually compute just works at the iterator level, so I don't think mapPartitions would hurt. All you do is pass through the parent's iterator. When you call compute() you're already deserializing the RDD, this won't create extra work.

mateiz · 2014-07-30T00:19:10Z

I'm okay with either this or collectSeq actually.

JoshRosen · 2014-07-30T00:44:31Z

I'm going to take another pass on this to see if I can implicitly grab the ClassTag from the caller's scope, so hold off on merging this for a bit.

SparkQA · 2014-07-30T05:18:48Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull

JoshRosen · 2014-07-30T05:53:41Z

This method is intended to be called by Scala classes that implement Java-friendly wrappers for the Spark Scala API. For instance, MLlib has APIs that accept RDD[LabelledPoint]. Ideally, the Java wrapper code can simply call the underlying Scala methods without having to worry about how they're implemented. Therefore, I think we should prefer the retag()-based approach, since collectSeq would require us to modify the Scala consumer of the RDD.

Since this is a private, internal API, we should be able to revisit this decision if we change our minds later.

JoshRosen · 2014-07-30T05:55:36Z

My last commit made classTag implicit in the retag() method, so in many cases the Scala code can be written as someJavaRDD.rdd.retag.[...].collect().

mateiz · 2014-07-30T06:00:48Z

Sure, sounds good. Did you see my comments on preserving partitions too though?

SparkQA · 2014-07-30T06:09:05Z

QA results for PR 1639:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull

SparkQA · 2014-07-30T06:14:16Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17426/consoleFull

SparkQA · 2014-07-30T07:01:43Z

QA results for PR 1639:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17426/consoleFull

mateiz · 2014-07-30T07:10:51Z

In case you don't see the hidden comment above: I don't think mapPartitions would hurt performance here. All you do is pass through the parent's iterator. When you call compute() you're already deserializing the RDD, so this won't create extra work in that case.

mateiz · 2014-07-30T07:11:29Z

Basically it's a shorter way of writing what you wrote. Take a look at MapPartitionsRDD.

JoshRosen · 2014-07-30T17:51:27Z

I've updated this to use mapPartitions().

SparkQA · 2014-07-30T17:54:00Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17458/consoleFull

mateiz · 2014-07-30T18:14:51Z

LGTM, feel free to merge it when it passes tests

JoshRosen · 2014-07-30T20:06:20Z

Jenkins, retest this please.

SparkQA · 2014-07-30T20:09:23Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17474/consoleFull

SparkQA · 2014-07-30T20:51:30Z

QA results for PR 1639:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17474/consoleFull

JoshRosen · 2014-07-31T04:33:27Z

Jenkins, retest this please.

SparkQA · 2014-07-31T04:38:58Z

QA tests have started for PR 1639. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17550/consoleFull

SparkQA · 2014-07-31T05:34:43Z

QA results for PR 1639:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17550/consoleFull

JoshRosen · 2014-07-31T05:42:20Z

Alright, I've merged this. Thanks for the review!

The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197). There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose. Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers. Author: Josh Rosen <joshrosen@apache.org> Closes apache#1639 from JoshRosen/SPARK-2737 and squashes the following commits: 572b4c8 [Josh Rosen] Replace newRDD[T] with mapPartitions(). 469d941 [Josh Rosen] Preserve partitioner in retag(). af78816 [Josh Rosen] Allow retag() to get classTag implicitly. d1d54e6 [Josh Rosen] [SPARK-2737] Add retag() method for changing RDDs' ClassTags.

…#1639)

mateiz reviewed Jul 30, 2014
View reviewed changes

Allow retag() to get classTag implicitly.

af78816

Preserve partitioner in retag().

469d941

Replace newRDD[T] with mapPartitions().

572b4c8

asfgit closed this in 4fb2593 Jul 31, 2014

sunchao added a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://102993150 [Boson] Upgrade Boson version to 0.2.11-beta (apache…

f51069e

…#1639)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2737] Add retag() method for changing RDDs' ClassTags. #1639

[SPARK-2737] Add retag() method for changing RDDs' ClassTags. #1639

JoshRosen commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

marmbrus commented Jul 29, 2014

mateiz Jul 30, 2014

JoshRosen Jul 30, 2014

mateiz Jul 30, 2014

mateiz Jul 30, 2014

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

mateiz commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

mateiz commented Jul 30, 2014

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

JoshRosen commented Jul 31, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

JoshRosen commented Jul 31, 2014

[SPARK-2737] Add retag() method for changing RDDs' ClassTags. #1639

[SPARK-2737] Add retag() method for changing RDDs' ClassTags. #1639

Conversation

JoshRosen commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

marmbrus commented Jul 29, 2014

mateiz Jul 30, 2014

Choose a reason for hiding this comment

JoshRosen Jul 30, 2014

Choose a reason for hiding this comment

mateiz Jul 30, 2014

Choose a reason for hiding this comment

mateiz Jul 30, 2014

Choose a reason for hiding this comment

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

mateiz commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

mateiz commented Jul 30, 2014

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

mateiz commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

JoshRosen commented Jul 31, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

JoshRosen commented Jul 31, 2014