[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause #24155

jnh5y · 2024-01-19T20:30:13Z

Adds DISTRIBUTED BY clause to CREATE TABLE
Adds Distribution support to CatalogTable
Add classes

TableDistribution
SupportsBucketing

Adds distribution to TableDescriptor.

This includes validation.

What is the purpose of the change

This implements part of FLIP-376. Namely it adds the Table API and SQL language support for DISTRIBUTED BY.

Brief change log

Adds DISTRIBUTED BY clause to CREATE TABLE
Adds Distribution support to CatalogTable
Add classes

TableDistribution
SupportsBucketing

Adds distribution to TableDescriptor.

Verifying this change

This change added tests and can be verified as follows:

Added tests to cover

Parsing and unparsing SQL.
TableDescriptor use
Table resolution and validation
MergeTableLike (CREATE TABLE LIKE)
Operation conversion
Use with the TestValues connector.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): yes
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? (JavaDocs)

jnh5y · 2024-01-19T20:34:53Z

This implements the first part of FLINK-33494 / FLIP-376.

Notably, the things left to do are:
FLINK-34172 (ALTER TABLE support)

flinkbot · 2024-01-19T20:36:42Z

CI report:

85e6fd1 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

jnh5y · 2024-01-19T23:28:52Z

@flinkbot run azure

* Adds DISTRIBUTED BY clause to CREATE TABLE * Adds Distribution support to CatalogTable * Add classes - TableDistribution - SupportsBucketing Adds distribution to TableDescriptor. This includes validation.

twalthr

Awesome work @jnh5y. I added a bunch of comments. The biggest is around moving the interpretation of the algorithm strategy to a later stage (from parsing maybe to converting to CatalogTable). In the future, we might support DISTRIBUTED BY HASH(YEAR(timestamp)) which is why we shouldn't reserve HASH as a keyword but treat it as an identifier instead. This should already allow

DISTRIBUTED BY `HASH`(a, b)

but we won't go further (i.e. not parsing it as an expression).

twalthr · 2024-01-29T14:49:42Z

flink-table/flink-sql-parser/src/main/codegen/data/Parser.tdd

    "DRAIN"
    "ENFORCED"
    "ESTIMATED_COST"
    "EXTENDED"
    "FUNCTIONS"
+    "HASH"


HASH doesn't need to be a reserved keyword, it could be a identifier (similar to functions like HASH()). A later layer could check for the only two supported algorithms.

Ok, I tried removing HASH as a reserved keyword.

I am hitting issues around RANGE being a keyword. I'm trying to read the distributionKind using SqlIdentifier and I haven't quite figured out how to make that work.

The FLIP didn't have the algorithm as quoted, so that's why I added HASH as a reserved keyword. Are you suggesting that algorithm (if supplied) must be quoted?

I wasn't aware that RANGE is already a reserved keyword. This of course changes the picture, we don't want to force users using backticks by default. If this is the case, I guess we have to make HASH a reserved keyword as well.

Do you know the difference between nonReservedKeywords, reservedKeywords, and keywords? I'm wondering if it at least makes sense to still allow hash as a column name without backticks if we add it to the right list. We currently have too many keywords which basically forces the user to use backticks all the time.

twalthr · 2024-01-29T14:50:51Z

flink-table/flink-sql-parser/src/main/codegen/data/Parser.tdd

@@ -520,6 +526,7 @@
  # Please keep the keyword in alphabetical order if new keyword is added.
  nonReservedKeywordsToAdd: [
    # not in core, added in Flink
+    "BUCKETS"


why is only BUCKETS added here?

I'll admit that I don't have a complete understanding of the lists in Parser.tdd. Which other keywords should be added here?

You should only add or modify code if you know what you are doing ;-)

flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/ddl/SqlCreateTable.java

flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/ddl/SqlDistribution.java

...k-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/utils/ParserResource.java

...in/java/org/apache/flink/table/planner/operations/converters/SqlReplaceTableAsConverter.java

twalthr · 2024-01-29T16:08:22Z

...-planner/src/main/java/org/apache/flink/table/planner/plan/abilities/sink/BucketingSpec.java

+ * No properties. This only checks whether the interface is implemented again during deserialization
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
+@JsonTypeName("Bucketing")


what is this property good for? other spec don't have it

Are you referring to @JsonTypeName("Bucketing") or something else?

I'll admit that I copied this class directly from FLIP-376. PartitioningSpec does have a JsonTypeName.

(If you are referring to JsonIgnoreProperties, that is in plenty of other classes. I could imagine removing it from this class since we only expect to use this as a marker interface. Thoughts?)

Are you referring to @JsonTypeName("Bucketing") or something else?

Yes, I also don't recall why I added it. Might be a mistake or feedback that I got during ML discussion. In any case we should only add code if we know what it is doing (and we want it that way).

Ah, figured it out... @JsonTypeName("Bucketing") overrides using the class name.

We either get

"type" : "BucketingSpec"

or

"type" : "Bucketing"

In the compiled plan.

I think that the override makes sense.

...e-planner/src/test/java/org/apache/flink/table/planner/factories/TestValuesTableFactory.java

.../src/test/java/org/apache/flink/table/planner/operations/SqlDdlToOperationConverterTest.java

...table-planner/src/test/scala/org/apache/flink/table/planner/catalog/CatalogTableITCase.scala

...planner/src/main/java/org/apache/flink/table/planner/operations/SqlCreateTableConverter.java

...table/flink-table-common/src/main/java/org/apache/flink/table/catalog/TableDistribution.java

flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/api/TableDescriptor.java

twalthr · 2024-01-30T15:36:42Z

flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/api/TableDescriptor.java

@@ -334,6 +352,12 @@ public Builder format(
            return this;
        }

+        /** Define which columns this table is distributed by. */
+        public Builder distributedBy(CatalogTable.TableDistribution tableDistribution) {


yes, but just distributedInto as in SQL

flink-table/flink-table-common/src/main/java/org/apache/flink/table/catalog/CatalogTable.java

twalthr · 2024-01-30T15:40:53Z

...-planner/src/main/java/org/apache/flink/table/planner/plan/abilities/sink/BucketingSpec.java

+ * No properties. This only checks whether the interface is implemented again during deserialization
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
+@JsonTypeName("Bucketing")


Are you referring to @JsonTypeName("Bucketing") or something else?

Yes, I also don't recall why I added it. Might be a mistake or feedback that I got during ML discussion. In any case we should only add code if we know what it is doing (and we want it that way).

flink-table/flink-table-common/src/main/java/org/apache/flink/table/catalog/CatalogTable.java

twalthr

LGTM, thanks @jnh5y for updating all the different locations.

jnh5y · 2024-02-05T19:11:40Z

LGTM, thanks @jnh5y for updating all the different locations.

@twalthr thanks for the review and help!

- Adds distribution support to CatalogTable - Adds connector ability SupportsBucketing - Adds distribution to TableDescriptor. This closes apache#24155.

jnh5y force-pushed the flink-34494 branch 3 times, most recently from d0310b7 to c5ba1ef Compare January 19, 2024 23:16

[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause

a62914a

* Adds DISTRIBUTED BY clause to CREATE TABLE * Adds Distribution support to CatalogTable * Add classes - TableDistribution - SupportsBucketing Adds distribution to TableDescriptor. This includes validation.

jnh5y force-pushed the flink-34494 branch from c5ba1ef to a62914a Compare January 20, 2024 01:29

twalthr reviewed Jan 29, 2024

View reviewed changes

jnh5y added 6 commits January 29, 2024 15:33

Handling a number of nits.

3f6d974

Responding to more comments.

a06274d

Factoring TableDistribution out into its own class.

70ee925

Adding some convenience methods.

25f31bb

JavaDoc updates and simplifying a test.

a5e79cc

Fixing a test.

6b8ddab

twalthr requested changes Jan 30, 2024

View reviewed changes

jnh5y added 5 commits January 30, 2024 11:37

Responding to feedback.

68c750b

Adding a CatalogTable.Builder.

cac0bf4

Adding tests for TableDescriptor methods.

e234364

Restore tests and fixing API Tests.

6e0719f

Cleaning up parser changes.

e4aa6ef

jnh5y commented Jan 30, 2024

View reviewed changes

flink-table/flink-table-common/src/main/java/org/apache/flink/table/catalog/CatalogTable.java Show resolved Hide resolved

jnh5y and others added 4 commits January 30, 2024 16:18

Various clean up.

36afc27

Added serialization.

8594330

Allowing for empty distributions for backwards compatibility.

46712af

[FLINK-33495] Finalize DISTRIBUTED BY

79f5e36

twalthr mentioned this pull request Feb 5, 2024

[DRAFT] FLIP-376 DISTRIBUTED BY #24073

Closed

Spotless.

6882a2b

twalthr approved these changes Feb 5, 2024

View reviewed changes

Test fixes.

f297f77

Using a Java 8 set constructor.

3d98efc

Update to make error messages stable so that tests pass.

85e6fd1

twalthr closed this in 6c9ac5c Feb 19, 2024

flinkbot added the component=TableSQL/Planner label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause #24155

[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause #24155

jnh5y commented Jan 19, 2024

jnh5y commented Jan 19, 2024 •

edited

Loading

flinkbot commented Jan 19, 2024 •

edited

Loading

jnh5y commented Jan 19, 2024

twalthr left a comment

twalthr Jan 29, 2024

jnh5y Jan 30, 2024

twalthr Jan 30, 2024

twalthr Jan 30, 2024

twalthr Jan 29, 2024

jnh5y Jan 30, 2024

twalthr Jan 30, 2024

twalthr Jan 29, 2024

jnh5y Jan 29, 2024

twalthr Jan 30, 2024

jnh5y Jan 30, 2024

twalthr Jan 30, 2024

twalthr Jan 30, 2024

twalthr left a comment

jnh5y commented Feb 5, 2024

[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause #24155

[FLINK-33495][FLINK-33496] Add DISTRIBUTED BY clause #24155

Conversation

jnh5y commented Jan 19, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

jnh5y commented Jan 19, 2024 • edited Loading

flinkbot commented Jan 19, 2024 • edited Loading

CI report:

jnh5y commented Jan 19, 2024

twalthr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twalthr left a comment

Choose a reason for hiding this comment

jnh5y commented Feb 5, 2024

jnh5y commented Jan 19, 2024 •

edited

Loading

flinkbot commented Jan 19, 2024 •

edited

Loading