Integrate the catalog with the Calcite planner #13686

paul-rogers · 2023-01-18T00:46:55Z

This PR is currently a draft. Resolving merge conflicts after splitting out some of the code to other PRs.

Prior PRs added the catalog (table metadata) foundations, and an improved set of table functions. This PR brings it all together:

Validates the MSQ INSERT and REPLACE statements against the catalog
- Clustering, partitioning and other table details can be set in the catalog instead of the SQL statement
- Catalog types are loosely enforced for MSQ. (More work is needed to precisely enforce types.)
- The catalog can create a "sealed" table: only columns defined in the catalog can be used in MSQ.
Allows defining external tables and partial external tables (AKA "connections") in the catalog, then fill in the remaining details at runtime via a table function.
Allows parameters (including array parameters) to work with MSQ queries
Extends the PARTITION BY clause to accept string literals for the time partitioning
Extends MSQ to give the planner control over the type of the emitted segment columns
MSQ ITs to validate the new "ad-hoc" table functions
Documentation

To allow all the above to work:

Validation for MSQ statements moves out of the handers into a Druid-specific version of the SQL validator.
Druid-specific Calcite operator to represent a Druid ingest.
The catalog API is passed into the Druid planner (which required changes in the many tests that set up the planner).
The catalog can now be enabled in the Broker to allow the planner to interact with the Druid table metadata extension.
Many new tests to verify the catalog integration and improved MSQ statement validation.
Improved catalog type parsing in anticipation of supporting complex types.
Factored out the "per run" items from the planner into a planner toolbox, leaving just the "per session" items in the planner.
Resource shuttle now handles "partial table functions" for items defined in the catalog.

Release note

This PR introduces the full catalog functionality. See the documentation files for the details. In this version, the catalog is an extension: you must enable the catalog extension to use the catalog. Enabling the extension creates an additional table in your metadata database. We consider the catalog to be experimental, and the metadata table schema is subject to change.

Table functions, introduced in a prior PR, are production ready and independent of the catalog. "Partial table functions" (define some of the properties in the catalog, some in SQL) are new in this PR and are experimental, along with the catalog itself.

Hints to reviewers

Much of this PR is doc files, test code and minor cleanup. The core changes (those that could break a running system if done wrong) are:

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/*
sql/src/main/*

The real core of this PR is sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java: the place where we moved the former ad-hoc INSERT and REPLACE validation to instead run within the SQL validator.

No runtime code was changed: all the non-trivial changes are in the SQL planner.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

sql/src/main/java/org/apache/druid/sql/calcite/table/DatasourceTable.java

server/src/main/java/org/apache/druid/catalog/model/TableDefn.java

server/src/main/java/org/apache/druid/catalog/model/TypeParser.java

server/src/main/java/org/apache/druid/catalog/model/facade/DatasourceFacade.java

extensions-core/druid-catalog/src/test/java/org/apache/druid/catalog/sql/LiveCatalogTest.java

...sions-core/druid-catalog/src/main/java/org/apache/druid/catalog/sql/LiveCatalogResolver.java

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

* Validate INSERT, REPLACE against the catalog (partial) * Obtain partitioning, clustering, etc. from catalog * Strict schema enforcement on ingest * Parameters work with MSQ queries * String literals for PARTITIONED BY * Revise handlers to use new validation logic * Catalog-defined external tables as MSQ input sources * Array-valued function arguments (improvements) * Documentation

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

vogievetsky · 2023-01-26T00:44:13Z

docs/multi-stage-query/reference.md

+PARTITIONED BY 'HOUR'
+
+-- Or
+PARTITIOND BY 'PT1H'


Would it possible to make this not work? Like if the user wants to do a PARTITIOND BY <period> they have to do PARTITIONED BY TIME_FLOOR(__time, <period>) the reasoning for this is that people should really be sticking to the existing partition types. When people partition by whacky periods they are asking for trouble...

One of the biggest offenders of this that I have encountered is users wanting to partition by "week" (P1W or P7D). It seems like a cool zany middle ground between day and month, but in reality it is asking for trouble later on.

Great comment. As it turns out, allowing a string doesn't change much. The user can still say TIME_FLOOR(__time, 'P1W'). There is code within the new validation code to accept a limited set of times. So, if we don't like P1W, we simply don't accept it. The code currently accepts only the strings that match the SQL symbols we already support.

This change, by the way is a "nice to have" to support moving validation out of ad-hoc code into the Calcite validator. Basically, Calcite literals know about strings, but not Druid Granularities so we now translate the SQL literals to strings, and convert those literals to a Granularity during validation. Previously, we created a Granularity in the parser as a "special" field, but that doesn't play by the Calcite rules. Easy enough to not accept a string literal in the parser; leave it as the SQL keywords and TIME_FLOOR.

Note that "partitioning" is defined as time partitioning. Thus, we don't really need the TIME_FLOOR(__time part of the expression: we know that partitioning means __time and and means using TIME_FLOOR.

Also, in the catalog, partitioning is defined as an ISO 8601 string: P1D. Given that, allowing the same form in the SQL syntax isn't quite as crazy as it might first look. The catalog, also, validates the string to ensure it is one of the supported forms.

That said, aside from the code, do we have a list of the "actually" supported forms? Does anyone actually partition by second or minute, say?

Build fixes Cleaned up warnings in some IT code

* Revised Broker Guice integration * Added flush and resync operations to catalog cache * Fixed issues with listing catalog tables in info schema * Tests for info schema with the catalog

Simple Python-based Druid SDK More catalog IT coverage Partial fixes to handling catalog partial tables Added a PARAMTERS table to INFORMATION_SCHEMA Changed info schema numeric columns to BIGINT

imply-cheddar

I looked over DruidSqlValidator. Had various commentary on error messages and a few other things. Nothing seemed egregious, mostly just suggestions about wording, etc.

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

techdocsmith · 2023-05-19T18:49:12Z

@paul-rogers , this PR #14023 moved some things around, including docs/multi-stage-query/reference.md so that may be the cause of some merge conflicts.

github-actions · 2024-02-02T00:16:20Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@paul-rogers

… partitioning (#15836) This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning

@paul-rogers

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.

github-actions · 2024-03-01T00:17:47Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions · 2024-05-12T00:18:34Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-06-09T00:19:59Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

paul-rogers marked this pull request as draft January 18, 2023 00:47

paul-rogers changed the title ~~Integrates the catalog with the Calcite planner~~ Integrate the catalog with the Calcite planner Jan 18, 2023

github-advanced-security bot found potential problems Jan 18, 2023

View reviewed changes

clintropolis added Area - Querying Area - SQL labels Jan 18, 2023

paul-rogers mentioned this pull request Jan 18, 2023

Add delete syntax #13674

Closed

10 tasks

paul-rogers added 5 commits January 19, 2023 08:58

Fix

e911d36

Build fixes

931f418

Capture MSQ physical plan

4b5227c

Define MSQ storage types

e13fbba

paul-rogers force-pushed the 230117-cat-calcite branch from 264aa28 to e13fbba Compare January 20, 2023 23:39

github-advanced-security bot found potential problems Jan 21, 2023

View reviewed changes

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java Fixed Show fixed Hide fixed

paul-rogers added 2 commits January 20, 2023 19:27

Build fixes

96d38b3

Merge branch 'master' into 230117-cat-calcite

a27e2db

vogievetsky reviewed Jan 26, 2023

View reviewed changes

paul-rogers added 7 commits January 27, 2023 16:32

Added IT for new table fns

9a3d55e

Build fixes Cleaned up warnings in some IT code

Further integration testing

f7c8c53

* Revised Broker Guice integration * Added flush and resync operations to catalog cache * Fixed issues with listing catalog tables in info schema * Tests for info schema with the catalog

Jupyter notebook to test the catalog

7aac767

Simple Python-based Druid SDK More catalog IT coverage Partial fixes to handling catalog partial tables Added a PARAMTERS table to INFORMATION_SCHEMA Changed info schema numeric columns to BIGINT

Wire up catalog table functions in Calcite

f84b0a0

End-t-end catalog functionality works

e9dec23

Catalog ITs pass

2f00569

Cleanup

0bf1535

abhishekagarwal87 requested a review from cryptoe February 8, 2023 11:08

paul-rogers added 2 commits February 8, 2023 15:39

Merge branch 'master' into 230117-cat-calcite

185cbe5

Python API fix

0ba5197

imply-cheddar reviewed Feb 10, 2023

View reviewed changes

paul-rogers mentioned this pull request Feb 16, 2023

Improve MSQ Rollup Experience with Catalog Integration #13816

Open

paul-rogers added 2 commits February 27, 2023 13:08

Merge branch 'master' into 230117-cat-calcite

9ea5c31

Merge cleanup

7d9872d

paul-rogers added 6 commits March 8, 2023 08:25

Remove dependency fixes

f150a63

Merge branch 'master' into 230117-cat-calcite

433e672

Address review comments

3218a2f

Address review comments

a231ee2

Catalog tables use Druid, not SQL types

8ee81ba

Cleanup

7e0e59c

paul-rogers mentioned this pull request Mar 9, 2023

Toward catalog completion #13912

Closed

7 tasks

jon-wei mentioned this pull request Dec 1, 2023

Allow broker to use catalog for datasource schemas for SQL queries #15469

Merged

9 tasks

zachjsh mentioned this pull request Jan 17, 2024

Integrate catalog schema validation into planner -WIP #15711

Closed

10 tasks

github-actions bot added the stale label Feb 2, 2024

zachjsh mentioned this pull request Feb 6, 2024

Extend the PARTITIONED BY clause to accept string literals for the time partitioning #15836

Merged

10 tasks

zachjsh mentioned this pull request Feb 21, 2024

Move INSERT & REPLACE validation to the Calcite validator #15908

Merged

10 tasks

zachjsh mentioned this pull request Feb 24, 2024

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

Merged

10 tasks

github-actions bot closed this Mar 1, 2024

zachjsh reopened this Mar 12, 2024

github-actions bot added Area - Batch Ingestion Area - Streaming Ingestion Area - Dependencies Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 and removed stale labels Mar 12, 2024

zachjsh mentioned this pull request Apr 10, 2024

INSERT/REPLACE can omit clustering when catalog has default #16260

Merged

10 tasks

github-actions bot added the stale label May 12, 2024

github-actions bot closed this Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate the catalog with the Calcite planner #13686

Integrate the catalog with the Calcite planner #13686

paul-rogers commented Jan 18, 2023 •

edited

Loading

vogievetsky Jan 26, 2023

paul-rogers Jan 27, 2023

imply-cheddar left a comment

techdocsmith commented May 19, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Mar 1, 2024

github-actions bot commented May 12, 2024

github-actions bot commented Jun 9, 2024

Integrate the catalog with the Calcite planner #13686

Integrate the catalog with the Calcite planner #13686

Conversation

paul-rogers commented Jan 18, 2023 • edited Loading

Release note

Hints to reviewers

vogievetsky Jan 26, 2023

Choose a reason for hiding this comment

paul-rogers Jan 27, 2023

Choose a reason for hiding this comment

imply-cheddar left a comment

Choose a reason for hiding this comment

techdocsmith commented May 19, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Mar 1, 2024

github-actions bot commented May 12, 2024

github-actions bot commented Jun 9, 2024

paul-rogers commented Jan 18, 2023 •

edited

Loading