Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate the catalog with the Calcite planner #13686

Closed
wants to merge 33 commits into from

Conversation

paul-rogers
Copy link
Contributor

@paul-rogers paul-rogers commented Jan 18, 2023

This PR is currently a draft. Resolving merge conflicts after splitting out some of the code to other PRs.

Prior PRs added the catalog (table metadata) foundations, and an improved set of table functions. This PR brings it all together:

  • Validates the MSQ INSERT and REPLACE statements against the catalog
    • Clustering, partitioning and other table details can be set in the catalog instead of the SQL statement
    • Catalog types are loosely enforced for MSQ. (More work is needed to precisely enforce types.)
    • The catalog can create a "sealed" table: only columns defined in the catalog can be used in MSQ.
  • Allows defining external tables and partial external tables (AKA "connections") in the catalog, then fill in the remaining details at runtime via a table function.
  • Allows parameters (including array parameters) to work with MSQ queries
  • Extends the PARTITION BY clause to accept string literals for the time partitioning
  • Extends MSQ to give the planner control over the type of the emitted segment columns
  • MSQ ITs to validate the new "ad-hoc" table functions
  • Documentation

To allow all the above to work:

  • Validation for MSQ statements moves out of the handers into a Druid-specific version of the SQL validator.
  • Druid-specific Calcite operator to represent a Druid ingest.
  • The catalog API is passed into the Druid planner (which required changes in the many tests that set up the planner).
  • The catalog can now be enabled in the Broker to allow the planner to interact with the Druid table metadata extension.
  • Many new tests to verify the catalog integration and improved MSQ statement validation.
  • Improved catalog type parsing in anticipation of supporting complex types.
  • Factored out the "per run" items from the planner into a planner toolbox, leaving just the "per session" items in the planner.
  • Resource shuttle now handles "partial table functions" for items defined in the catalog.

Release note

This PR introduces the full catalog functionality. See the documentation files for the details. In this version, the catalog is an extension: you must enable the catalog extension to use the catalog. Enabling the extension creates an additional table in your metadata database. We consider the catalog to be experimental, and the metadata table schema is subject to change.

Table functions, introduced in a prior PR, are production ready and independent of the catalog. "Partial table functions" (define some of the properties in the catalog, some in SQL) are new in this PR and are experimental, along with the catalog itself.

Hints to reviewers

Much of this PR is doc files, test code and minor cleanup. The core changes (those that could break a running system if done wrong) are:

  • extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/*
  • sql/src/main/*

The real core of this PR is sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java: the place where we moved the former ad-hoc INSERT and REPLACE validation to instead run within the SQL validator.

No runtime code was changed: all the non-trivial changes are in the SQL planner.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@paul-rogers paul-rogers marked this pull request as draft January 18, 2023 00:47
@paul-rogers paul-rogers changed the title Integrates the catalog with the Calcite planner Integrate the catalog with the Calcite planner Jan 18, 2023
* Validate INSERT, REPLACE against the catalog (partial)
* Obtain partitioning, clustering, etc. from catalog
* Strict schema enforcement on ingest
* Parameters work with MSQ queries
* String literals for PARTITIONED BY
* Revise handlers to use new validation logic
* Catalog-defined external tables as MSQ input sources
* Array-valued function arguments (improvements)
* Documentation
PARTITIONED BY 'HOUR'

-- Or
PARTITIOND BY 'PT1H'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it possible to make this not work? Like if the user wants to do a PARTITIOND BY <period> they have to do PARTITIONED BY TIME_FLOOR(__time, <period>) the reasoning for this is that people should really be sticking to the existing partition types. When people partition by whacky periods they are asking for trouble...

One of the biggest offenders of this that I have encountered is users wanting to partition by "week" (P1W or P7D). It seems like a cool zany middle ground between day and month, but in reality it is asking for trouble later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment. As it turns out, allowing a string doesn't change much. The user can still say TIME_FLOOR(__time, 'P1W'). There is code within the new validation code to accept a limited set of times. So, if we don't like P1W, we simply don't accept it. The code currently accepts only the strings that match the SQL symbols we already support.

This change, by the way is a "nice to have" to support moving validation out of ad-hoc code into the Calcite validator. Basically, Calcite literals know about strings, but not Druid Granularities so we now translate the SQL literals to strings, and convert those literals to a Granularity during validation. Previously, we created a Granularity in the parser as a "special" field, but that doesn't play by the Calcite rules. Easy enough to not accept a string literal in the parser; leave it as the SQL keywords and TIME_FLOOR.

Note that "partitioning" is defined as time partitioning. Thus, we don't really need the TIME_FLOOR(__time part of the expression: we know that partitioning means __time and and means using TIME_FLOOR.

Also, in the catalog, partitioning is defined as an ISO 8601 string: P1D. Given that, allowing the same form in the SQL syntax isn't quite as crazy as it might first look. The catalog, also, validates the string to ensure it is one of the supported forms.

That said, aside from the code, do we have a list of the "actually" supported forms? Does anyone actually partition by second or minute, say?

Build fixes
Cleaned up warnings in some IT code
* Revised Broker Guice integration
* Added flush and resync operations to catalog cache
* Fixed issues with listing catalog tables in info schema
* Tests for info schema with the catalog
Simple Python-based Druid SDK
More catalog IT coverage
Partial fixes to handling catalog partial tables
Added a PARAMTERS table to INFORMATION_SCHEMA
Changed info schema numeric columns to BIGINT
Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over DruidSqlValidator. Had various commentary on error messages and a few other things. Nothing seemed egregious, mostly just suggestions about wording, etc.

@paul-rogers paul-rogers mentioned this pull request Mar 9, 2023
7 tasks
@techdocsmith
Copy link
Contributor

@paul-rogers , this PR #14023 moved some things around, including docs/multi-stage-query/reference.md so that may be the cause of some merge conflicts.

Copy link

github-actions bot commented Feb 2, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 2, 2024
zachjsh added a commit that referenced this pull request Feb 9, 2024
… partitioning (#15836)


This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning
zachjsh added a commit that referenced this pull request Feb 22, 2024
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.
Copy link

github-actions bot commented Mar 1, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 1, 2024
@zachjsh zachjsh reopened this Mar 12, 2024
@github-actions github-actions bot added Area - Batch Ingestion Area - Streaming Ingestion Area - Dependencies Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 and removed stale labels Mar 12, 2024
Copy link

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label May 12, 2024
Copy link

github-actions bot commented Jun 9, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants