INSERT/REPLACE dimension target column types are validated against source input expressions #15962

zachjsh · 2024-02-24T02:42:12Z

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, tables that are defined in the catalog will have their simple dimension defined column types validated against source input expressions mapped to them during DML INSERT/REPLACE operations. Complex measure types columns will not be validated at this time, this will come in a follow up pr. Also enforcing sealed / non-sealed mode; if a table is sealed, no undefined columns may be added to the table during ingestion. Also addressing remaining comments from #15836 and #15908.
This PR has:

…umn-types

sql/src/main/java/org/apache/druid/sql/calcite/planner/CalcitePlanner.java

kgyrtkirk · 2024-03-12T12:16:27Z

...ions-core/druid-catalog/src/test/java/org/apache/druid/catalog/sql/CatalogIngestionTest.java

+                // Scan query lists columns in alphabetical order independent of the
+                // SQL project list or the defined schema. Here we just check that the
+                // set of columns is correct, but not their order.
+                .columns("b", "e", "v0", "v1", "v2", "v3")


its hard to interpret this plan like this...what will permute v0 to be the 1st column?
shouldn't that be in the plan? ...or the rename of the columns to their output names?

This seems to be an existing issue as the comment above describes, the results are still written in the proper order and mapped to the appropriate columns

I manually tested this, and the columns are written in proper order, and results are what you'd expect. It seems that none of the existing dml unit tests test for results; the dml test engine in use does not allow for selects. Maybe an excellent addition in the near future. We can add full integration tests once this feature is more complete. How does that sound to you?

kgyrtkirk · 2024-03-12T12:30:13Z

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

+    // with any coercions applied. We update the validated node type of the WITH node here so
+    // that they are consistent.
+    if (source instanceof SqlWith) {
+      final RelDataType withBodyType = getValidatedNodeTypeIfKnown(((SqlWith) source).body);


I wonder if this is the right approach; is the issue coming from something like:

the with may supply null-s

the target table doesn't accept null value-s

I wonder why not make this replaceWithDefault sacred by adding COALESCE(colVal, 0) and similar crap; so that calcite is also aware it...the runtime could remove the coalesce if its know that its pointless and done

It seems that this goes beyond null handling; implicitly coerced types coming from calcite's default enabled coercion rules, do not seem to be updated in the WITH node type, only the WITH select body node type. This would still be needed.

this seems to be the consequence of the long commented lines around line 223 - about not knowing the types - so just pass unknown -> I think that should be fixed somehow ; and then this wouldn't be needed either...

I think we can get around this and the WITH issue with the following approach:

override inferUnknownTypes

node will have the full expression including the alias -> access by fieldName will be ok;

instead of passing an unknownType - I've passed some type with artifically created fields to a referene to the druid table (this could probably be done differently)

this way validateTargetType could become kinda part of the validation

however...it have lead to consistency errors as well -> we are reading a select which produces a BIGINT as DOUBLE and similar...which is clearly invalid.

my branch is here

I think the right way to fix this will be to add proper rewrites at the right point in the query and let the system do its job - I think that won't be even a hack as opposed to multiple long long comments about doing something which is clearly questionable.

I think to do that a way could be something like:

override performUnconditionalRewrites to be able to process INSERT nodes as well

identify columns by name from SqlInsert#source

rewrite all columns which are of interest with a dummy typed function like druidDouble or similar

use the method's validation logic to ensure that the conversion will happen/okay/etc

the output type of the function will be the support type for that column - so that won't cause any problems

leave the function there - however as it was just a placeholder it can be removed during native translation without any risks..

kgyrtkirk · 2024-03-12T12:35:20Z

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

-          colName,
-          typeFactory.createTypeWithNullability(relType, true)
-      ));
+      if (NullHandling.replaceWithDefault()) {


I wonder if the resultset may contain null even in case replaceWithDefault() is true
fyi: RowSignatures#toRelDataType creates a nullable strings regardless of replaceWithDefault

There is a test that includes a String column, and it seems to produce a targetType of VARCHAR NOT NULL when this is false, and a nullable VARCHAR when true, so I think this is handled correctly, but let me know if you think otherwise.

for string types, default value mode treats null and '' interchangeably, so it probably should be marked as nullable? tbh its not super well defined, but we do always mark string values as nullable in default value mode since the empty strings behave as null.

For numbers though default value mode effectively means that they don't exist.

This default value mode is deprecated now, so we probably don't have to spend too much time thinking about how things should behave, so we should probably match the computed schema stuff (strings always nullable in either mode, numbers only nullable in sql compatible mode)

Thanks! Fixed

I think if that's the case the logic should not live here in this file - instead it should be in a more central location.... its bad to see these conditions flow everywhere

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

…umn-types

kgyrtkirk · 2024-03-19T14:59:16Z

extensions-core/druid-catalog/src/test/java/org/apache/druid/catalog/sql/CatalogInsertTest.java

+/**
+ * Test the use of catalog specs to drive MSQ ingestion.
+ */
+public class CatalogInsertTest extends CalciteCatalogInsertTest


there are a lots of duplication with CatalogInsertTest and CatalogReplaceTest (4 lines differ)

and also CatalogQueryTest is pretty similar - seems like it has a different method order...

could you put all these into some common place (a rule?) and reuse that everywhere?

kgyrtkirk · 2024-03-19T15:14:52Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteCatalogIngestionDmlTest.java

+import org.apache.druid.sql.calcite.table.DatasourceTable;
+import org.apache.druid.sql.calcite.table.DruidTable;
+
+public class CalciteCatalogIngestionDmlTest extends CalciteIngestionDmlTest


why there is no @Test in this "Test"?

this seems like a base class, maybe javadoc to explain that would be useful

I think in that case it should be abstract and not have a Test ending ; like TestBase or something

kgyrtkirk · 2024-03-20T09:01:32Z

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

+    // with any coercions applied. We update the validated node type of the WITH node here so
+    // that they are consistent.
+    if (source instanceof SqlWith) {
+      final RelDataType withBodyType = getValidatedNodeTypeIfKnown(((SqlWith) source).body);


this seems to be the consequence of the long commented lines around line 223 - about not knowing the types - so just pass unknown -> I think that should be fixed somehow ; and then this wouldn't be needed either...

kgyrtkirk · 2024-03-20T09:04:40Z

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

-          colName,
-          typeFactory.createTypeWithNullability(relType, true)
-      ));
+      if (NullHandling.replaceWithDefault()) {


I think if that's the case the logic should not live here in this file - instead it should be in a more central location.... its bad to see these conditions flow everywhere

clintropolis · 2024-03-22T21:59:08Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteCatalogIngestionDmlTest.java

+import org.apache.druid.sql.calcite.table.DatasourceTable;
+import org.apache.druid.sql.calcite.table.DruidTable;
+
+public class CalciteCatalogIngestionDmlTest extends CalciteIngestionDmlTest


this seems like a base class, maybe javadoc to explain that would be useful

clintropolis · 2024-03-22T22:00:13Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteCatalogInsertTest.java

+  }
+
+  @Test
+  public void testInsertIntoExistingWithIncompatibleTypeAssignment()


nit: it might be nice to add a few more tests like this to encode some of the behaviors in tests

* address remaining comments from apache#15836

acfa4ca

zachjsh changed the title ~~INSERT/REPLACE target column types are validated against source input expressions~~ INSERT/REPLACE target column types are validated against source input expressions (WIP) Feb 24, 2024

github-actions bot added the Area - Querying label Feb 24, 2024

zachjsh added 6 commits February 26, 2024 14:18

Merge remote-tracking branch 'apache/master' into validate-target-col…

e6c3b3b

…umn-types

* address remaining comments from apache#15908

d30a729

* add test that exposes relational algebra issue

4947aba

* simplify test exposing issue

50bef96

* fix

23c6173

Merge remote-tracking branch 'apache/master' into validate-target-col…

f58a417

…umn-types

zachjsh changed the title ~~INSERT/REPLACE target column types are validated against source input expressions (WIP)~~ INSERT/REPLACE dimension target column types are validated against source input expressions Mar 7, 2024

zachjsh added 2 commits March 6, 2024 21:12

* add tests for sealed / non-sealed

841dba5

* update test descriptions

d771bdb

zachjsh marked this pull request as ready for review March 7, 2024 05:32

zachjsh added 2 commits March 8, 2024 15:20

* fix test failure when -Ddruid.generic.useDefaultValueForNull=true

cd4bfa9

* check type assignment based on natice Druid types

2d3e86e

kgyrtkirk reviewed Mar 12, 2024

View reviewed changes

zachjsh added 7 commits March 13, 2024 00:33

* add tests that cover missing jacoco coverage

651ad56

* add replace tests

1460bea

Merge remote-tracking branch 'apache/master' into validate-target-col…

ddf8189

…umn-types

* add more tests and comments about column ordering

f85c21a

* simplify tests

7544ac9

* review comments

162bd09

* remove commented line

55b3e90

zachjsh requested a review from kgyrtkirk March 18, 2024 16:05

zachjsh added 2 commits March 19, 2024 14:49

Merge remote-tracking branch 'apache/master' into validate-target-col…

bb5b8c0

…umn-types

* STRING family types should be validated as non-null

a89b4a3

kgyrtkirk reviewed Mar 20, 2024

View reviewed changes

clintropolis approved these changes Mar 22, 2024

View reviewed changes

zachjsh merged commit 8370db1 into apache:master Mar 25, 2024
85 checks passed

zachjsh deleted the validate-target-column-types branch March 25, 2024 16:34

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

zachjsh commented Feb 24, 2024 •

edited

Loading

kgyrtkirk Mar 12, 2024

zachjsh Mar 13, 2024

zachjsh Mar 14, 2024 •

edited

Loading

kgyrtkirk Mar 12, 2024

zachjsh Mar 14, 2024

kgyrtkirk Mar 20, 2024

kgyrtkirk Mar 20, 2024 •

edited

Loading

kgyrtkirk Mar 12, 2024

zachjsh Mar 14, 2024

clintropolis Mar 19, 2024

zachjsh Mar 19, 2024

kgyrtkirk Mar 20, 2024

kgyrtkirk Mar 19, 2024

kgyrtkirk Mar 19, 2024

clintropolis Mar 22, 2024

kgyrtkirk Mar 25, 2024

kgyrtkirk Mar 20, 2024

kgyrtkirk Mar 20, 2024

clintropolis Mar 22, 2024

clintropolis Mar 22, 2024

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

Conversation

zachjsh commented Feb 24, 2024 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachjsh Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgyrtkirk Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachjsh commented Feb 24, 2024 •

edited

Loading

zachjsh Mar 14, 2024 •

edited

Loading

kgyrtkirk Mar 20, 2024 •

edited

Loading