Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32658][CORE] Fix PartitionWriterStream partition length overflow #29474

Closed

Conversation

jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Aug 19, 2020

What changes were proposed in this pull request?

The count in PartitionWriterStream should be a long value, instead of int. The issue is introduced by abef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to FetchFailedException: Stream is corrupted error.

Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues.

Why are the changes needed?

This is a regression and bug fix.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix.

@jiangxb1987
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127626 has finished for PR 29474 at commit dd38f6d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!! LGTM

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@HyukjinKwon
Copy link
Member

cc @zhengruifeng FYI. This is a blocker

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127628 has finished for PR 29474 at commit dd38f6d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127641 has finished for PR 29474 at commit dd38f6d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127660 has finished for PR 29474 at commit dd38f6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in f793977 Aug 20, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Aug 20, 2020
…flow

### What changes were proposed in this pull request?

The `count` in `PartitionWriterStream` should be a long value, instead of int. The issue is introduced by apache/sparkabef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to `FetchFailedException: Stream is corrupted` error.

Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues.

### Why are the changes needed?

This is a regression and bug fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix.

Closes #29474 from jiangxb1987/fixPartitionWriteStream.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f793977)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@gatorsmile
Copy link
Member

cc @vanzin @squito @jerryshao who are the major reviewers of the original PR #25007

@mridulm
Copy link
Contributor

mridulm commented Aug 24, 2020

Nice catch @jiangxb1987 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants