[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

WeichenXu123 · 2016-10-19T05:10:49Z

What changes were proposed in this pull request?

Fix bug of RDD zipWithIndex generating wrong result when one partition contains more than 2147483647 records.
Fix bug of RDD zipWithUniqueId generating wrong result when one partition contains more than 2147483647 records.

How was this patch tested?

test added.

rxin · 2016-10-19T05:18:40Z

core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala

@@ -64,8 +64,14 @@ class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)

  override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = {
    val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
-    firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
-      (x._1, split.startIndex + x._2)
+    val parentIter = firstParent[T].iterator(split.prev, context)


we should add a line of comment saying we don't use Scala's zipWithIndex to avoid overflowing.

rxin · 2016-10-19T05:18:58Z

core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala

-      (x._1, split.startIndex + x._2)
+    val parentIter = firstParent[T].iterator(split.prev, context)
+    new Iterator[(T, Long)] {
+      var idxAcc: Long = -1L


private[this]

also why don't we initialize this to split.startIndex?

I'd also rename this just "index"

or rather split.startIndex - 1 given the current code for idxAcc.

rxin · 2016-10-19T05:22:05Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

@@ -833,6 +833,30 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
    }
  }

+  test("zipWithIndex with partition size exceeding MaxInt") {
+    val result = sc.parallelize(Seq(1), 1).mapPartitions(


doesn't this test case take forever to run? I think the way you want to do this is to create a helper function def zipWithIndex[T](iterator: Iterator[T], startingOffset: Long): Iterator[T], and then just use a large starting offset. Then you just test that helper method without creating an end-to-end test case that loops over 2 billion elements.

yeah, here I run the test, because the loop of CPU is very fast, running this testcase on my machine is about 3s. Does it still need optimized ?

hm that's about 2.99s longer than I'd like for a unit test ...

all right. I'll update the testcase.

tejasapatil · 2016-10-19T06:43:35Z

Have you looked for other places in the codebase which would also produce wrong result by using scala's zipWithIndex() ?

SparkQA · 2016-10-19T07:35:16Z

Test build #67170 has finished for PR 15550 at commit 18c3f49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2016-10-19T12:21:02Z

@tejasapatil I check the code where reference RDD.zipWithIndex, there are 7 usages currently. Because current test data won't generate big enough partition so they don't triggers the bug. The referencing code are all correct no need to update them.

tejasapatil · 2016-10-19T17:02:55Z

I meant scala's zipWithIndex() and not RDD.zipWithIndex.

rxin · 2016-10-19T17:24:12Z

The current change LGTM (pending Jenkins). It would be great to check the usage as @tejasapatil said.

SparkQA · 2016-10-19T17:49:11Z

Test build #67202 has finished for PR 15550 at commit c942084.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2016-10-20T02:19:19Z

@tejasapatil check other reference to scala's zipWithIndex and fix similar case in RDD.zipWithUniqueId, thanks!

rxin · 2016-10-20T02:30:14Z

Cool - LGTM! (I will merge once Jenkins comes back positive)

SparkQA · 2016-10-20T04:29:55Z

Test build #67228 has finished for PR 15550 at commit b83a606.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-20T06:41:03Z

Merging in master / branch-2.0.

…Id index value overflowing ## What changes were proposed in this pull request? - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records. - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow. (cherry picked from commit 3975516) Signed-off-by: Reynold Xin <rxin@databricks.com>

…niqueId index value overflowing apache#15550

…Id index value overflowing ## What changes were proposed in this pull request? - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records. - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.

rxin reviewed Oct 19, 2016

View reviewed changes

WeichenXu123 added 2 commits October 19, 2016 20:34

update.

c942084

fix same bug in RDD.zipWithUniqueId

b83a606

WeichenXu123 force-pushed the fix_rdd_zipWithIndex_overflow branch from 18c3f49 to c942084 Compare October 19, 2016 15:28

WeichenXu123 changed the title ~~[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex generating wrong result when one partition contains more than 2147483647 records~~ [SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing Oct 20, 2016

asfgit closed this in 3975516 Oct 20, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Oct 20, 2016

[EXT][SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithU…

c729661

…niqueId index value overflowing apache#15550

WeichenXu123 deleted the fix_rdd_zipWithIndex_overflow branch April 24, 2019 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

WeichenXu123 commented Oct 19, 2016 •

edited

Loading

rxin Oct 19, 2016

rxin Oct 19, 2016

rxin Oct 19, 2016

mridulm Oct 19, 2016

rxin Oct 19, 2016

WeichenXu123 Oct 19, 2016

rxin Oct 19, 2016 •

edited

Loading

WeichenXu123 Oct 19, 2016

tejasapatil commented Oct 19, 2016

SparkQA commented Oct 19, 2016

WeichenXu123 commented Oct 19, 2016

tejasapatil commented Oct 19, 2016

rxin commented Oct 19, 2016

SparkQA commented Oct 19, 2016

WeichenXu123 commented Oct 20, 2016

rxin commented Oct 20, 2016

SparkQA commented Oct 20, 2016

rxin commented Oct 20, 2016

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

Conversation

WeichenXu123 commented Oct 19, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rxin Oct 19, 2016

Choose a reason for hiding this comment

rxin Oct 19, 2016

Choose a reason for hiding this comment

rxin Oct 19, 2016

Choose a reason for hiding this comment

mridulm Oct 19, 2016

Choose a reason for hiding this comment

rxin Oct 19, 2016

Choose a reason for hiding this comment

WeichenXu123 Oct 19, 2016

Choose a reason for hiding this comment

rxin Oct 19, 2016 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 Oct 19, 2016

Choose a reason for hiding this comment

tejasapatil commented Oct 19, 2016

SparkQA commented Oct 19, 2016

WeichenXu123 commented Oct 19, 2016

tejasapatil commented Oct 19, 2016

rxin commented Oct 19, 2016

SparkQA commented Oct 19, 2016

WeichenXu123 commented Oct 20, 2016

rxin commented Oct 20, 2016

SparkQA commented Oct 20, 2016

rxin commented Oct 20, 2016

WeichenXu123 commented Oct 19, 2016 •

edited

Loading

rxin Oct 19, 2016 •

edited

Loading