Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade calcite in druid #13532

Closed
abhishekagarwal87 opened this issue Dec 8, 2022 · 6 comments
Closed

Upgrade calcite in druid #13532

abhishekagarwal87 opened this issue Dec 8, 2022 · 6 comments

Comments

@abhishekagarwal87
Copy link
Contributor

abhishekagarwal87 commented Dec 8, 2022

Motivation

We are currently stuck on an older version of calcite. This is because of old guava dependencies coming from Hadoop 2. Till we get rid of hadoop 2, we cannot upgrade calcite.

The upgrade will be very helpful because

  • There are bug fixes in recent version of Calcite. We have seen some queries fail in calcite layer. We tried these queries with a test patch that has newer calcite. We didn't see the failures with newer calcite version.
  • Extending calcite is easier in later version. We had to copy a bunch of boilerplate code to add custom syntax. This boilerplate code can be removed.
  • Calcite provides hints that druid developers want to use. However these are not available in the current version we use.
  • We want to be able to upgrade a calcite dependency quickly in case a security vuln is discovered.
  • There is significant drift already between the version we use and the latest calcite version. Between these versions, some API has been deprecated and then removed. The more we delay, the higher the drift is and more work in upgrading.

Proposed Solutions

The major blocker for the upgrade is old guava dependencies.

Removing Hadoop 2 entirely

We can remove Hadoop 2 entirely and thus rid ourselves of its transitive dependencies. We have a Hadoop 3 distribution profile and will be shipping a Hadoop 3 compatible druid distribution bundle with 25.0. There is some discussion about it here https://lists.apache.org/thread/1j5w9dmt1gp8hx31tvrmyomcko4mlp03

However, there are concerns in the community about Hadoop 3 not being a viable alternative. We have now classic batch ingestion and also SQL-based batch ingestion. We have also added MM-less ingestion on Kubernetes. With this, users now have the ability to use a common/shared infra to run druid ingestion jobs. However, MM-less ingestion is still experimental.

Calcite shading

The other option is to shade the calcite jars. Calcite dev team is not going to do this. Instead, we can shade the jars ourselves. There is a prototype here and it works. I ran into a problem with tests however it was fixed in calcite 1.30.0

Next steps

To proceed with the upgrade, we can go with the shading approach. There is a prototype that already works. We need to figure out where we want to host these shaded jars. Once done, we also need to test the SQL thoroughly to avoid regressions.

@abhishekagarwal87 abhishekagarwal87 changed the title Upgrading calcite to latest version Upgrade calcite in druid Dec 8, 2022
@gianm
Copy link
Contributor

gianm commented Dec 9, 2022

Do you know if Hadoop 2 + Guava 19 has been tested? Guava 19 is Calcite's minimum version these days, so we could update to that as a compromise version if it works for Hadoop 2.

I raised a PR here so it can at least run through CI: #13544. That won't exercise all the Hadoop distributions we want to support, though, so we'll need additional testing even if it passes.

@gianm
Copy link
Contributor

gianm commented Dec 9, 2022

A note about shading: it looks like Apache Beam does this too. Here’s their build file for Calcite: https://github.com/apache/beam/blob/52753a9c854786ad0732af4b8577d1cdcfc66047/vendor/calcite-1_28_0/build.gradle. It relocates Guava (com.google.common) and some other packages.

They do their own releases as well. Here's the latest one: https://mvnrepository.com/artifact/org.apache.beam/beam-vendor-calcite-1_28_0/0.2

@gianm
Copy link
Contributor

gianm commented Dec 9, 2022

Another question for anyone who has tried this: what went wrong when using the latest Calcite with Guava 16.0.1? From the thread at https://lists.apache.org/thread/oy84c1607hyjhkbop8svtrzlzgj5632q, it seems like this may actually work OK, even though Calcite currently claims a minimum version of 19.

@gianm
Copy link
Contributor

gianm commented Jan 14, 2023

Relevant action in this sub-thread: https://lists.apache.org/thread/oy84c1607hyjhkbop8svtrzlzgj5632q. Calcite targets a minimum Guava version of 16.0.1 as of https://issues.apache.org/jira/browse/CALCITE-5428. I'm working on integrating this into Druid. So far, I ran into these three things along the way:

https://issues.apache.org/jira/browse/CALCITE-5477
https://issues.apache.org/jira/browse/CALCITE-5478
https://issues.apache.org/jira/browse/CALCITE-5479

Other than those things, there are also a lot of changes in Calcite that require adjustments on the Druid side.

@abhishekagarwal87
Copy link
Contributor Author

@gianm - there is already a draft PR that you could work off - #12258

@gianm
Copy link
Contributor

gianm commented Feb 19, 2023

Thanks @abhishekagarwal87. I did indeed start from your PR as a base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants