New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-11626] Upgrading Guava to 30.1-jre while keeping 25.1-jre for Hadoop/Cassandra modules #13804
Conversation
Codecov Report
@@ Coverage Diff @@
## master #13804 +/- ##
==========================================
- Coverage 82.97% 82.95% -0.03%
==========================================
Files 469 469
Lines 58343 58343
==========================================
- Hits 48411 48399 -12
- Misses 9932 9944 +12
Continue to review full report at Codecov.
|
Run SQL Postcommit |
Run Spark ValidatesRunner |
Run Dataflow ValidatesRunner |
Run Java HadoopFormatIO Performance Test |
Run Java PostCommit |
Run Java_Examples_Dataflow PreCommit |
34 successful and 1 skipped checks at 15d7a3d. |
Run Java PreCommit |
Run SQL Postcommit |
Run Spark ValidatesRunner |
Run Dataflow ValidatesRunner |
Run Java HadoopFormatIO Performance Test |
Run Java PostCommit |
Run SQL Postcommit |
1 similar comment
Run SQL Postcommit |
Run SQL Postcommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is still a draft, but LGTM while I'm here :)
@ibzib Thanks. I see all checks are green. Moved this PR out of draft. |
R: @ibzib @kennknowles @aaltay For #13740 (comment) by @aaltay
Let me know if I need to add more information somewhere. |
We can add a note to the release note about this. I think a more sustainable solution would be to add a page java sdk, its dependencies, and addressing common issues related to dependencies. (@kennknowles and @tysonjh - would such a page be helpful to users?) |
I think so. There isn't much developer content that would cover this area in the existing website (e.g. https://beam.apache.org/documentation/sdks/java/). It could be a place to host a FAQ or best practices that could include this information, recommending using BOMs where possible, etc. A dependency issue with Guava is sadly a pretty well discussed problem throughout the web so developers are pretty familiar with the steps for resolution. |
Run SQL Postcommit |
Run Spark ValidatesRunner |
Run Dataflow ValidatesRunner |
Run Java HadoopFormatIO Performance Test |
Got it. It is your call as you are all more familiar with Java specific concerns. |
Run SQL Postcommit |
Run Java PostCommit |
Run Spark ValidatesRunner |
Run Dataflow ValidatesRunner |
Run Java HadoopFormatIO Performance Test |
Run Python PreCommit |
@aaltay It's ready for merge; 34 successful checks, including "Java PostCommit".
Yes, I also observe the same. For Beam users, this is just a normal version upgrade of transitive Guava dependency. I've added an item into the "Breaking Changes" section of CHANGES.md to make it easy to identify when this change happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the best PR description I have ever read. Thank you!
I agree with @suztomo that generally users know about the issues with Guava. While technically a breaking change (as many dep upgrades are) I think this will improve things greatly for users and will fix potentially surprising bugs. |
Sorry to hijack this PR but I wanted to ask @suztomo or someone else to take care of upgrading our vendored gRPC. It is probably a good idea because of the security issue mentioned in https://issues.apache.org/jira/browse/BEAM-11227 Can someone please take a look |
@iemejia Let me create a PR to see what breaks. (I don't know how it's released though) |
Thanks @suztomo! If you can get it upgraded and tested I can kick off a release. I don't recall exactly, but I think I can figure it out and document it. |
@kennknowles I think I found the document. https://github.com/apache/beam/tree/master/vendor Now questions is how we confirm the CVE is addressed https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17287496&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17287496 (let's continue comments there) |
This PR upgrades the non-vendored Guava version to the latest 30.1-jre, while keeping the version 25.1-jre for certain modules (Hadoop and Cassandra-related) that require the old version of Guava.
Why do I want the latest Guava?
When Beam publishes a recommended version of Guava for Dataflow users (#13737, WIP), I want the recommended version in line with the one in the GCP Libraries BOM (with "-jre" suffix). This is because Google Cloud client libraries are built and tested with the newer version of Guava. I want Beam's Dataflow and Google Cloud Platform modules to be built, tested, and used with the same version of Guava as much as possible.
If we don't do this PR, we would end up a situation where the GCP Libraries BOM recommends to use Guava 30 and Beam's GCP BOM recommends Guava 25.
What is the problem with Guava 25?
When a library touch classes or methods that only exist in the newer version of Guava, it fails with NoClassDefFoundError or NoSuchMethodError. For example, gcsio uses
Uninterruptibles.sleepUninterruptibly(java.time.Duration)
in it and Linkage Checker detects the usage:The method only exists in Guava 28 or higher. This might not be a problem for Dataflow-only users for now, but this may cause other use cases of the library. Therefore, I want to recommend the newer version of Guava to GCP users.
Problem with newer Guava version in Hadoop/Cassandra
If I naively upgrade the Guava version to 30.1-jre, the tests failed with
NoSuchMethodError
forFutures.addCallback
andNoSuchFieldError
forDIGIT
(CharMatcher). Details are in BEAM-11626.This PR fixes the problem by keeping the Guava version lower for the Hadoop/Cassandra-related modules.
Where is the Guava dependency declared?
The following Gradle modules declare dependency to the guava variable:
Other than tests, the 3 modules declaring the Guava dependencies are
sdks/java/io/kinesis
,sdks/java/io/google-cloud-platform
, andsdks/java/extensions/sql/zetasql
.sdks/java/io/kinesis
module hascom.amazonaws:amazon-kinesis-client:1.13.0
built with Guava 26.0-jrecom.amazonaws:amazon-kinesis-producer:0.14.1
built with Guava 24.1.1-jreorg.apache.hadoop:hadoop-yarn-common:2.10.1
(Yarn's WebApp class) and Guava 30. The conflict already exists in Guava 29. Therefore, existing Dataflow Zetasql users will not see a problem with Guava 30.The
sdks/java/maven-archetypes/examples
module is tricky one. I want Hadoop/Cassandra users to use Guava 25.1 and others to use Guava 30.What's the impact to Beam's Cassandra / Hadoop users?
There's no impact to the Beam Cassandra and Hadoop artifacts. The Maven artifact
org.apache.beam:beam-sdks-java-io-hadoop-format:2.27.0
,org.apache.beam:beam-sdks-java-io-cassandra:2.27.0
, or org.apache.beam:beam-sdks-java-io-hadoop-file-system:2.27.0does not declare Guava dependency.
Instruction for Hadoop / Cassandra Beam users
If Beam Cassandra / Hadoop users use Beam with beam-sdks-java-io-kinesis, beam-sdks-java-io-google-cloud-platform, or beam-sdks-java-extensions-sql-zetasql, then the users need to pin Guava version to 25.1-jre. They can use
<dependencyManagement>
for Maven andforce
for Gradle.Linkage Check
https://gist.github.com/suztomo/beb9033d39da1545d117750c0602ae69
No new linkage errors for the default list. There is one error in Yarn's WebApp class (see above), which is a provided dependency of hadoop-client.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.