Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11626] Upgrading Guava to 30.1-jre while keeping 25.1-jre for Hadoop/Cassandra modules #13804

Merged
merged 9 commits into from Feb 17, 2021

Conversation

suztomo
Copy link
Contributor

@suztomo suztomo commented Jan 25, 2021

This PR upgrades the non-vendored Guava version to the latest 30.1-jre, while keeping the version 25.1-jre for certain modules (Hadoop and Cassandra-related) that require the old version of Guava.

Why do I want the latest Guava?

When Beam publishes a recommended version of Guava for Dataflow users (#13737, WIP), I want the recommended version in line with the one in the GCP Libraries BOM (with "-jre" suffix). This is because Google Cloud client libraries are built and tested with the newer version of Guava. I want Beam's Dataflow and Google Cloud Platform modules to be built, tested, and used with the same version of Guava as much as possible.

If we don't do this PR, we would end up a situation where the GCP Libraries BOM recommends to use Guava 30 and Beam's GCP BOM recommends Guava 25.

What is the problem with Guava 25?

When a library touch classes or methods that only exist in the newer version of Guava, it fails with NoClassDefFoundError or NoSuchMethodError. For example, gcsio uses Uninterruptibles.sleepUninterruptibly(java.time.Duration) in it and Linkage Checker detects the usage:

(com.google.guava:guava:25.1-jre) com.google.common.util.concurrent.Uninterruptibles's method sleepUninterruptibly(java.time.Duration) is not found;
  referenced by 3 class files
    com.google.cloud.hadoop.gcsio.cooplock.CoopLockOperationDao (com.google.cloud.bigdataoss:gcsio:2.1.6)
    com.google.cloud.hadoop.gcsio.cooplock.CoopLockRecordsDao (com.google.cloud.bigdataoss:gcsio:2.1.6)
    com.google.cloud.hadoop.gcsio.testing.InMemoryObjectEntry (com.google.cloud.bigdataoss:gcsio:2.1.6)

The method only exists in Guava 28 or higher. This might not be a problem for Dataflow-only users for now, but this may cause other use cases of the library. Therefore, I want to recommend the newer version of Guava to GCP users.

Problem with newer Guava version in Hadoop/Cassandra

If I naively upgrade the Guava version to 30.1-jre, the tests failed with NoSuchMethodError for Futures.addCallback and
NoSuchFieldError for DIGIT (CharMatcher). Details are in BEAM-11626.

This PR fixes the problem by keeping the Guava version lower for the Hadoop/Cassandra-related modules.

Where is the Guava dependency declared?

The following Gradle modules declare dependency to the guava variable:

suztomo-macbookpro44% find . -name 'build.gradle' |xargs grep 'library.java.guava'
./sdks/java/core/build.gradle:  shadowTest library.java.guava_testlib
./sdks/java/io/kinesis/build.gradle:  compile library.java.guava
./sdks/java/io/kinesis/build.gradle:  testCompile library.java.guava_testlib
./sdks/java/io/amazon-web-services2/build.gradle:  testCompile library.java.guava_testlib
./sdks/java/io/google-cloud-platform/build.gradle:  compile library.java.guava
./sdks/java/io/contextualtextio/build.gradle:    testCompile library.java.guava_testlib
./sdks/java/extensions/sql/zetasql/build.gradle:  compile library.java.guava
./sdks/java/maven-archetypes/examples/build.gradle:    'guava.version': dependencies.create(project.library.java.guava).getVersion(),
./runners/google-cloud-dataflow-java/build.gradle:  testCompile library.java.guava_testlib

Other than tests, the 3 modules declaring the Guava dependencies are sdks/java/io/kinesis, sdks/java/io/google-cloud-platform, and sdks/java/extensions/sql/zetasql.

  • sdks/java/io/kinesis module has
    • com.amazonaws:amazon-kinesis-client:1.13.0 built with Guava 26.0-jre
    • com.amazonaws:amazon-kinesis-producer:0.14.1 built with Guava 24.1.1-jre
    • Linkage Checker found no new linkage errors for beam-sdks-java-io-kinesis (link).
  • Google Cloud libraries are built with the latest Guava
  • The zetasql client 2020.10.1 is built with Guava 29.
    • Linkage Checker detected a potential conflict between org.apache.hadoop:hadoop-yarn-common:2.10.1 (Yarn's WebApp class) and Guava 30. The conflict already exists in Guava 29. Therefore, existing Dataflow Zetasql users will not see a problem with Guava 30.

The sdks/java/maven-archetypes/examples module is tricky one. I want Hadoop/Cassandra users to use Guava 25.1 and others to use Guava 30.

What's the impact to Beam's Cassandra / Hadoop users?

There's no impact to the Beam Cassandra and Hadoop artifacts. The Maven artifact org.apache.beam:beam-sdks-java-io-hadoop-format:2.27.0, org.apache.beam:beam-sdks-java-io-cassandra:2.27.0, or org.apache.beam:beam-sdks-java-io-hadoop-file-system:2.27.0
does not declare Guava dependency.

Instruction for Hadoop / Cassandra Beam users

If Beam Cassandra / Hadoop users use Beam with beam-sdks-java-io-kinesis, beam-sdks-java-io-google-cloud-platform, or beam-sdks-java-extensions-sql-zetasql, then the users need to pin Guava version to 25.1-jre. They can use <dependencyManagement> for Maven and force for Gradle.

Linkage Check

https://gist.github.com/suztomo/beb9033d39da1545d117750c0602ae69

No new linkage errors for the default list. There is one error in Yarn's WebApp class (see above), which is a provided dependency of hadoop-client.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Jan 25, 2021

Codecov Report

Merging #13804 (39c7369) into master (3930d12) will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13804      +/-   ##
==========================================
- Coverage   82.97%   82.95%   -0.03%     
==========================================
  Files         469      469              
  Lines       58343    58343              
==========================================
- Hits        48411    48399      -12     
- Misses       9932     9944      +12     
Impacted Files Coverage Δ
...hon/apache_beam/runners/direct/test_stream_impl.py 91.91% <0.00%> (-2.21%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 89.52% <0.00%> (-1.80%) ⬇️
sdks/python/apache_beam/io/iobase.py 84.55% <0.00%> (-0.27%) ⬇️
sdks/python/apache_beam/transforms/util.py 95.66% <0.00%> (-0.18%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.86% <0.00%> (-0.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3930d12...39c7369. Read the comment docs.

@suztomo suztomo marked this pull request as draft January 25, 2021 16:55
@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo suztomo changed the title [BEAM-11626] Configuration.testRuntime to have Guava 25.1 for Hadoop modules [BEAM-11626] Upgrading Guava to 30.1-jre while keeping 25.1-jre for Hadoop/Cassandra modules Jan 25, 2021
@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

34 successful and 1 skipped checks at 15d7a3d.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run Java PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run SQL Postcommit

1 similar comment
@suztomo
Copy link
Contributor Author

suztomo commented Jan 26, 2021

Run SQL Postcommit

@ibzib
Copy link
Contributor

ibzib commented Jan 27, 2021

Run SQL Postcommit

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is still a draft, but LGTM while I'm here :)

@suztomo suztomo marked this pull request as ready for review January 27, 2021 01:29
@suztomo
Copy link
Contributor Author

suztomo commented Jan 27, 2021

@ibzib Thanks. I see all checks are green. Moved this PR out of draft.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 27, 2021

R: @ibzib @kennknowles @aaltay

For #13740 (comment) by @aaltay

Ack. I think this would be an undocumented hurdle for the impacted users. I am not sure what is the best course of action. Hopefully @kennknowles would have a recommendation.

Let me know if I need to add more information somewhere.

@aaltay
Copy link
Member

aaltay commented Feb 3, 2021

R: @ibzib @kennknowles @aaltay

For #13740 (comment) by @aaltay

Ack. I think this would be an undocumented hurdle for the impacted users. I am not sure what is the best course of action. Hopefully @kennknowles would have a recommendation.

Let me know if I need to add more information somewhere.

We can add a note to the release note about this.

I think a more sustainable solution would be to add a page java sdk, its dependencies, and addressing common issues related to dependencies. (@kennknowles and @tysonjh - would such a page be helpful to users?)

@tysonjh
Copy link
Contributor

tysonjh commented Feb 12, 2021

R: @ibzib @kennknowles @aaltay
For #13740 (comment) by @aaltay

Ack. I think this would be an undocumented hurdle for the impacted users. I am not sure what is the best course of action. Hopefully @kennknowles would have a recommendation.

Let me know if I need to add more information somewhere.

We can add a note to the release note about this.

I think a more sustainable solution would be to add a page java sdk, its dependencies, and addressing common issues related to dependencies. (@kennknowles and @tysonjh - would such a page be helpful to users?)

I think so. There isn't much developer content that would cover this area in the existing website (e.g. https://beam.apache.org/documentation/sdks/java/). It could be a place to host a FAQ or best practices that could include this information, recommending using BOMs where possible, etc.

A dependency issue with Guava is sadly a pretty well discussed problem throughout the web so developers are pretty familiar with the steps for resolution.

@suztomo
Copy link
Contributor Author

suztomo commented Feb 12, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Feb 12, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Feb 12, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Feb 12, 2021

Run Java HadoopFormatIO Performance Test

@aaltay
Copy link
Member

aaltay commented Feb 12, 2021

R: @ibzib @kennknowles @aaltay
For #13740 (comment) by @aaltay

Ack. I think this would be an undocumented hurdle for the impacted users. I am not sure what is the best course of action. Hopefully @kennknowles would have a recommendation.

Let me know if I need to add more information somewhere.

We can add a note to the release note about this.
I think a more sustainable solution would be to add a page java sdk, its dependencies, and addressing common issues related to dependencies. (@kennknowles and @tysonjh - would such a page be helpful to users?)

I think so. There isn't much developer content that would cover this area in the existing website (e.g. beam.apache.org/documentation/sdks/java). It could be a place to host a FAQ or best practices that could include this information, recommending using BOMs where possible, etc.

A dependency issue with Guava is sadly a pretty well discussed problem throughout the web so developers are pretty familiar with the steps for resolution.

Got it. It is your call as you are all more familiar with Java specific concerns.

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Feb 16, 2021

Run Python PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Feb 17, 2021

@aaltay It's ready for merge; 34 successful checks, including "Java PostCommit".

so developers are pretty familiar with the steps for resolution.

Yes, I also observe the same. For Beam users, this is just a normal version upgrade of transitive Guava dependency. I've added an item into the "Breaking Changes" section of CHANGES.md to make it easy to identify when this change happened.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the best PR description I have ever read. Thank you!

@kennknowles kennknowles merged commit 8f87957 into apache:master Feb 17, 2021
@kennknowles
Copy link
Member

I agree with @suztomo that generally users know about the issues with Guava. While technically a breaking change (as many dep upgrades are) I think this will improve things greatly for users and will fix potentially surprising bugs.

@iemejia
Copy link
Member

iemejia commented Feb 19, 2021

Sorry to hijack this PR but I wanted to ask @suztomo or someone else to take care of upgrading our vendored gRPC. It is probably a good idea because of the security issue mentioned in https://issues.apache.org/jira/browse/BEAM-11227 Can someone please take a look

@suztomo
Copy link
Contributor Author

suztomo commented Feb 19, 2021

@iemejia Let me create a PR to see what breaks. (I don't know how it's released though)

@kennknowles
Copy link
Member

Thanks @suztomo! If you can get it upgraded and tested I can kick off a release. I don't recall exactly, but I think I can figure it out and document it.

@suztomo
Copy link
Contributor Author

suztomo commented Feb 22, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants