Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30228][BUILD][2.4] Update zstd-jni to 1.4.4-3 #31645

Closed
wants to merge 2 commits into from

Conversation

seayoun
Copy link

@seayoun seayoun commented Feb 25, 2021

What changes were proposed in this pull request?

Change zstd-jni version to version 1.4.4-3

Why are the changes needed?

This old zstd-jni(tag 1.3.3-2) has probability to read less data when shuffle read.
The ZstdInputStream in zstd-jni(tag 1.3.3-2) maybe return 0 after a read function call, this doesn't meet the standard of InputStream and the InputStream will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss.
zstd-jni issues:
luben/zstd-jni#159
zstd-jni commits:
luben/zstd-jni@7eec558

Does this PR introduce any user-facing change?

How was this patch tested?

@HyukjinKwon
Copy link
Member

cc @dongjoon-hyun FYI

@HyukjinKwon HyukjinKwon changed the title [SPARK-34536]zstd-jni tag 1.3.2-2 lead read less shuffle data [SPARK-30228][BUILD][2.4] Update zstd-jni to 1.4.4-3 Feb 25, 2021
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also run ./dev/test-dependencies.sh --replace-manifest?

@cloud-fan
Copy link
Contributor

Usually, we don't upgrade dependencies in released branches. But this is a correctness bug and I think we can make an exception here.

@cloud-fan
Copy link
Contributor

ok to test

@seayoun
Copy link
Author

seayoun commented Feb 25, 2021

Can you also run ./dev/test-dependencies.sh --replace-manifest?

OK

@cloud-fan
Copy link
Contributor

cc @srowen as well

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40047/

@HyukjinKwon
Copy link
Member

Yeah this one is fine.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40047/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is marked as a correctness issue, do you think we can have a test case, @seayoun ?
Also, I'm wondering why 1.4.4-3? The target commit is in 1.3.5-x, too.

@dongjoon-hyun
Copy link
Member

In the JIRA, I found the following, but the reason for the version choice is missing.

So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135466 has finished for PR 31645 at commit 03990d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 25, 2021

It could be hard to reproduce zstd returning 0 bytes before EOF. If we can we'd mostly be testing what zstd tests already - the stream just never returns 0 bytes before done. If it's pretty hard to test, I think it's OK to merge without a specific test.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

dongjoon-hyun pushed a commit that referenced this pull request Feb 25, 2021
### What changes were proposed in this pull request?

Change zstd-jni version to version 1.4.4-3

### Why are the changes needed?

This old zstd-jni(tag 1.3.3-2) has probability to read less data when shuffle read.
The `ZstdInputStream` in zstd-jni(tag 1.3.3-2) maybe return 0 after a read function call, this doesn't meet the standard of `InputStream` and the `InputStream` will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss.
zstd-jni issues:
luben/zstd-jni#159
zstd-jni commits:
luben/zstd-jni@7eec558

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #31645 from seayoun/yuhaiyang_update_zstd_jni.

Authored-by: yuhaiyang <yuhaiyang@yuhaiyangs-MacBook-Pro.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Merged to branch-2.4.

@seayoun
Copy link
Author

seayoun commented Feb 26, 2021

In the JIRA, I found the following, but the reason for the version choice is missing.

So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production.

Another issue for ZstdInputStream. It happends in 1.4.4-3
luben/zstd-jni#110

@dongjoon-hyun
Copy link
Member

Thanks, @SeeyouN . You meant that another issue is fixed at 1.4.4-3, right? (Instead of happens in 1.4.4-3)

Another issue for ZstdInputStream. It happends in 1.4.4-3

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
Change zstd-jni version to version 1.4.4-3

This old zstd-jni(tag 1.3.3-2) has probability to read less data when shuffle read.
The `ZstdInputStream` in zstd-jni(tag 1.3.3-2) maybe return 0 after a read function call, this doesn't meet the standard of `InputStream` and the `InputStream` will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss.
zstd-jni issues:
luben/zstd-jni#159
zstd-jni commits:
luben/zstd-jni@7eec558

Closes apache#31645 from seayoun/yuhaiyang_update_zstd_jni.

Authored-by: yuhaiyang <yuhaiyang@yuhaiyangs-MacBook-Pro.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 96f5137)

RB=3332514
BUG=LIHADOOP-63952
G=spark-reviewers
R=tgudivad,ekrogen
A=ekrogen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants