Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing #15550

Closed

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Oct 19, 2016

What changes were proposed in this pull request?

  • Fix bug of RDD zipWithIndex generating wrong result when one partition contains more than 2147483647 records.
  • Fix bug of RDD zipWithUniqueId generating wrong result when one partition contains more than 2147483647 records.

How was this patch tested?

test added.

@@ -64,8 +64,14 @@ class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)

override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = {
val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
(x._1, split.startIndex + x._2)
val parentIter = firstParent[T].iterator(split.prev, context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a line of comment saying we don't use Scala's zipWithIndex to avoid overflowing.

(x._1, split.startIndex + x._2)
val parentIter = firstParent[T].iterator(split.prev, context)
new Iterator[(T, Long)] {
var idxAcc: Long = -1L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[this]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why don't we initialize this to split.startIndex?

I'd also rename this just "index"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or rather split.startIndex - 1 given the current code for idxAcc.

@@ -833,6 +833,30 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
}
}

test("zipWithIndex with partition size exceeding MaxInt") {
val result = sc.parallelize(Seq(1), 1).mapPartitions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this test case take forever to run? I think the way you want to do this is to create a helper function def zipWithIndex[T](iterator: Iterator[T], startingOffset: Long): Iterator[T], and then just use a large starting offset. Then you just test that helper method without creating an end-to-end test case that loops over 2 billion elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, here I run the test, because the loop of CPU is very fast, running this testcase on my machine is about 3s. Does it still need optimized ?

Copy link
Contributor

@rxin rxin Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm that's about 2.99s longer than I'd like for a unit test ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right. I'll update the testcase.

@tejasapatil
Copy link
Contributor

Have you looked for other places in the codebase which would also produce wrong result by using scala's zipWithIndex() ?

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67170 has finished for PR 15550 at commit 18c3f49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@tejasapatil I check the code where reference RDD.zipWithIndex, there are 7 usages currently. Because current test data won't generate big enough partition so they don't triggers the bug. The referencing code are all correct no need to update them.

@tejasapatil
Copy link
Contributor

I meant scala's zipWithIndex() and not RDD.zipWithIndex.

@rxin
Copy link
Contributor

rxin commented Oct 19, 2016

The current change LGTM (pending Jenkins). It would be great to check the usage as @tejasapatil said.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67202 has finished for PR 15550 at commit c942084.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex generating wrong result when one partition contains more than 2147483647 records [SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing Oct 20, 2016
@WeichenXu123
Copy link
Contributor Author

@tejasapatil check other reference to scala's zipWithIndex and fix similar case in RDD.zipWithUniqueId, thanks!

@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

Cool - LGTM! (I will merge once Jenkins comes back positive)

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67228 has finished for PR 15550 at commit b83a606.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

Merging in master / branch-2.0.

asfgit pushed a commit that referenced this pull request Oct 20, 2016
…Id index value overflowing

## What changes were proposed in this pull request?

- Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.

- Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.

(cherry picked from commit 3975516)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 3975516 Oct 20, 2016
zzcclp added a commit to zzcclp/spark that referenced this pull request Oct 20, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…Id index value overflowing

## What changes were proposed in this pull request?

- Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.

- Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…Id index value overflowing

## What changes were proposed in this pull request?

- Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.

- Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.

## How was this patch tested?

test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
@WeichenXu123 WeichenXu123 deleted the fix_rdd_zipWithIndex_overflow branch April 24, 2019 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants