-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35670][BUILD] Upgrade ZSTD-JNI to 1.5.0-2 #32826
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. BTW, do you have a chance to test this with Apache Parquet/Avro/Kafka first project by project? Historically, we hit several incompatibility issue.
BTW, @dchristle . I quickly ran the micro benchmark on master branch on my Mac before and after. We also need to check the memory usage, too.
|
@dongjoon-hyun That's a good suggestion. I have three PRs to update Kafka, Parquet, and ORC: They appear to pass their respective CIs. I have less familiarity with Avro's build chains/codebase, so I did not attempt to test it yet. |
Thank you for your efforts. BTW, @dchristle . Please note that your ORC PR is not about ZSTD-JNI. It's native ZSTD library only. I commented on your ORC PR about the difference. For the following, I saw Kafka failures.
No worry~ For Apache Avro, they have a dependency bot. I guess they will catch up soon. Let's wait and see their activity.
In addition, all libraries should be synced inside Apache Spark because Apache Spark is using everything. |
Thank you for your message. The Kafka PR failures seem to not be related to the Zstd change -- the tests appear to be bugged/flaky, as many other recent PRs are also failing. I looked over the Zstd release notes but did not catch any obvious big changes that could trigger an incompatibility. However, my plan is to wait till the Kafka PR can pass the CI, and shepherd the change through. Regarding Spark: Is it necessary to have all dependencies upgraded to Zstd 1.5.x before merging if the Spark CI/dependency tests appear to pass? For instance, the move to 1.5.0-1 is a scheduled for Kafka 3.0 (unless there is a back-port), but I imagine that release will be some time from now. Regarding this PR: Do we have a good understanding of why the benchmark tests fail? I cannot tell if it is actually related to this code change. Thank you for your guidance with this process. |
Yes, for ORC it's the native C library and not Java. I have a tangential question for you: Does it make sense to use The reason I ask is because it is conceivable that |
It's just a historical fact. IMO, I believe that we need to replace it to
Yes, BTW, please note that your PR is merged to Apache ORC 1.7 which has no release plan yet. The situation is the same for the other communities. Apache Kafka with ZSTD 1.5? Apache Avro with ZSTD 1.5? Apache Parquet with ZSTD 1.5? Apache Spark should embrace those Apache Projects together because our customers are able to use them together in a single app. |
Yes. So it seems like, in order to get Spark on So far, it seems like there are no build/CI failures due to But, I don't have anywhere close to your experience, and your idea we should only incorporate specific versions of Given this, it seems the PR's I've pushed for |
@dchristle . I'm a big supporter of ZStandard and have no doubt that we need to upgrade ZSTD-JNI in the future. Your PR will be a part of Apache Spark definitely. It's a matter of timing. Here, I'm saying that what we need for Apache Spark. What we need is the actual verification by testing, not a hunch. Both of us don't want to break Apache Spark 3.2.0, do we? As you see SPARK-34651, I synced multiple Apache project for ZSTD-JNI 1.4.9-1 and Apache Spark 3.2.0 because there were incompatibility issues. For your PR, we can proceed in this way. First of all, let's make it sure that ZSTD-JNI 1.5 passes all UTs of Parquet/Kafka/Avro at least. Second, let's merge your Apache Spark PR first temporarily for the wider Apache Spark community testing. If something broken is found during Apache Spark 3.2.0 QA period, we can revert it during that period. |
I think there is a misunderstanding. My proposal is to backport If getting the change into upcoming dependency minor releases (1.10.x Avro, 2.8.x Kafka, 1.12.x or 1.13.x Parquet) truly does work with no issues, we can synchronize all dependencies in Spark's |
Well, do you know that the feature freeze of Apache Spark 3.2.0 is July 1st for now? It seems that that's the root cause of misunderstanding. For the rest of the plan, both of us know that we will have ZStandard 1.5 eventually as a community-bless versions across several Apache projects. There is no arguments about that.
|
1e585f0
to
ab43055
Compare
Thank you for updating to track new ZSTD-JNI 1.5.0-2, @dchristle . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, @dchristle . https://github.com/apache/spark/pull/32826/files#r647855915 is added again. :)
And, here is the updated benchmark result.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except one comment), @dchristle . Thanks.
As I mentioned before, we will merge this for Apache Spark 3.2.0. However, there is a chance of reverting from 3.2.0 due to the some regressions. Even in that case, we will try this for Apache Spark 3.3.0 based on your contributions and other Apache projects' releases.
Second, let's merge your Apache Spark PR first temporarily for the wider Apache Spark community testing. If something broken is found during Apache Spark 3.2.0 QA period, we can revert it during that period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I'm checking the benchmark result as a final review stage. It seems that there is some regression on JDK11 environment. I'll check the result on the linux box too. Could you double-check the result of ZStandardBenchmark
in your JDK11 environment, @dchristle ? You should generate the result twice (first with 1.4.9-1 and second with 1.5.0-2).
--- a/core/benchmarks/ZStandardBenchmark-jdk11-results.txt
+++ b/core/benchmarks/ZStandardBenchmark-jdk11-results.txt
@@ -6,22 +6,22 @@ OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Mac OS X 11.4
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark ZStandardCompressionCodec: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
-Compression 10000 times at level 1 without buffer pool 1523 1526 4 0.0 152315.0 1.0X
-Compression 10000 times at level 2 without buffer pool 1227 1229 2 0.0 122734.5 1.2X
-Compression 10000 times at level 3 without buffer pool 1548 1551 4 0.0 154821.8 1.0X
-Compression 10000 times at level 1 with buffer pool 782 793 13 0.0 78221.2 1.9X
-Compression 10000 times at level 2 with buffer pool 1127 1183 79 0.0 112668.4 1.4X
-Compression 10000 times at level 3 with buffer pool 1454 1469 21 0.0 145383.8 1.0X
+Compression 10000 times at level 1 without buffer pool 1451 1455 6 0.0 145071.2 1.0X
+Compression 10000 times at level 2 without buffer pool 447 517 53 0.0 44732.6 3.2X
+Compression 10000 times at level 3 without buffer pool 2287 2314 39 0.0 228662.8 0.6X
+Compression 10000 times at level 1 with buffer pool 1530 1534 6 0.0 153036.3 0.9X
+Compression 10000 times at level 2 with buffer pool 1894 1912 26 0.0 189350.2 0.8X
+Compression 10000 times at level 3 with buffer pool 2150 2218 96 0.0 215042.6 0.7X
OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Mac OS X 11.4
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark ZStandardCompressionCodec: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------------
-Decompression 10000 times from level 1 without buffer pool 1458 1458 1 0.0 145788.4 1.0X
-Decompression 10000 times from level 2 without buffer pool 1460 1465 7 0.0 145988.5 1.0X
-Decompression 10000 times from level 3 without buffer pool 2223 2261 55 0.0 222258.0 0.7X
-Decompression 10000 times from level 1 with buffer pool 1397 1397 0 0.0 139660.8 1.0X
-Decompression 10000 times from level 2 with buffer pool 1391 1395 5 0.0 139148.7 1.0X
-Decompression 10000 times from level 3 with buffer pool 2249 2315 94 0.0 224883.8 0.6X
+Decompression 10000 times from level 1 without buffer pool 1571 1571 1 0.0 157078.0 1.0X
+Decompression 10000 times from level 2 without buffer pool 1581 1586 7 0.0 158062.5 1.0X
+Decompression 10000 times from level 3 without buffer pool 2439 2514 107 0.0 243850.6 0.6X
+Decompression 10000 times from level 1 with buffer pool 1378 1381 5 0.0 137771.0 1.1X
+Decompression 10000 times from level 2 with buffer pool 1391 1392 2 0.0 139109.7 1.1X
+Decompression 10000 times from level 3 with buffer pool 1940 2106 235 0.0 193981.0 0.8X
Undo unintended whitespace changes.
For the linux box, it looks reasonable. In this case, we need to add the benchmark result into the PR. Could you follow the instruction at https://spark.apache.org/developer-tools.html
|
I updated the above result with a clean-build result, @dchristle . |
BTW, if you get the benchmark result via GitHub Action, please share the GitHub Action links with us. |
I ran the benchmark on your branches.
Here is the summary.
|
I made a PR to you, @dchristle . Please review and merge that if you think that's okay. And, let's finish this PR. |
Add benchmark result
Excellent! I had triggered the full benchmark suite, so it took much longer than necessary. Thanks for updating this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thank you, @dchristle .
BTW, I saw your result. You can use your result here too, @dchristle . I just wanted to help you to run the jobs. |
Let's use your commit. I didn't know how to specify only the |
GitHub Action passed. Thank you for your contribution and patience. We are very careful because this is very important to Apache Spark 3.2.0, @dchristle . Merged master to Apache Spark 3.2.0. |
I added you to the Apache Spark contributor group in JIRA and assigned SPARK-35670 to you. |
Thank you for your help with this. I'm very glad to contribute. I've already run some pre-production workflows using the |
Hi @dchristle and @dongjoon-hyun, |
@mixermt . It's enabled by default since Apache Spark 3.2.0.
Although ZSTD-JNI implementation is improved a lot recently, you will hit high memory usage if you turn off the configuration. |
Thanks @dongjoon-hyun for your quick reply 🙏 |
@mixermt . It seems that you are in the wrong community. You had better ask Apache Parquet community for Apache Parquet configuration issue instead of here. :) |
What changes were proposed in this pull request?
This PR aims to upgrade
zstd-jni
to 1.5.0-2, which useszstd
version 1.5.0.Why are the changes needed?
Major improvements to Zstd support are targeted for the upcoming 3.2.0 release of Spark. Zstd 1.5.0 introduces significant compression (+25% to 140%) and decompression (~15%) speed improvements in benchmarks described in more detail on the releases page:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Build passes build tests, but the benchmark tests seem flaky. I am unsure if this change is responsible. The error is:
https://github.com/dchristle/spark/runs/2776123749?check_suite_focus=true
cc: @dongjoon-hyun