Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

Merged
merged 20 commits into from
Mar 25, 2024

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Feb 24, 2024

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, tables that are defined in the catalog will have their simple dimension defined column types validated against source input expressions mapped to them during DML INSERT/REPLACE operations. Complex measure types columns will not be validated at this time, this will come in a follow up pr. Also enforcing sealed / non-sealed mode; if a table is sealed, no undefined columns may be added to the table during ingestion. Also addressing remaining comments from #15836 and #15908.
This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zachjsh zachjsh changed the title INSERT/REPLACE target column types are validated against source input expressions INSERT/REPLACE target column types are validated against source input expressions (WIP) Feb 24, 2024
@zachjsh zachjsh changed the title INSERT/REPLACE target column types are validated against source input expressions (WIP) INSERT/REPLACE dimension target column types are validated against source input expressions Mar 7, 2024
@zachjsh zachjsh marked this pull request as ready for review March 7, 2024 05:32
// Scan query lists columns in alphabetical order independent of the
// SQL project list or the defined schema. Here we just check that the
// set of columns is correct, but not their order.
.columns("b", "e", "v0", "v1", "v2", "v3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its hard to interpret this plan like this...what will permute v0 to be the 1st column?
shouldn't that be in the plan? ...or the rename of the columns to their output names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an existing issue as the comment above describes, the results are still written in the proper order and mapped to the appropriate columns

Copy link
Contributor Author

@zachjsh zachjsh Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested this, and the columns are written in proper order, and results are what you'd expect. It seems that none of the existing dml unit tests test for results; the dml test engine in use does not allow for selects. Maybe an excellent addition in the near future. We can add full integration tests once this feature is more complete. How does that sound to you?

// with any coercions applied. We update the validated node type of the WITH node here so
// that they are consistent.
if (source instanceof SqlWith) {
final RelDataType withBodyType = getValidatedNodeTypeIfKnown(((SqlWith) source).body);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the right approach; is the issue coming from something like:

  • the with may supply null-s
  • the target table doesn't accept null value-s

I wonder why not make this replaceWithDefault sacred by adding COALESCE(colVal, 0) and similar crap; so that calcite is also aware it...the runtime could remove the coalesce if its know that its pointless and done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this goes beyond null handling; implicitly coerced types coming from calcite's default enabled coercion rules, do not seem to be updated in the WITH node type, only the WITH select body node type. This would still be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be the consequence of the long commented lines around line 223 - about not knowing the types - so just pass unknown -> I think that should be fixed somehow ; and then this wouldn't be needed either...

Copy link
Member

@kgyrtkirk kgyrtkirk Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get around this and the WITH issue with the following approach:

  • override inferUnknownTypes
  • node will have the full expression including the alias -> access by fieldName will be ok;
  • instead of passing an unknownType - I've passed some type with artifically created fields to a referene to the druid table (this could probably be done differently)
  • this way validateTargetType could become kinda part of the validation

however...it have lead to consistency errors as well -> we are reading a select which produces a BIGINT as DOUBLE and similar...which is clearly invalid.

my branch is here

I think the right way to fix this will be to add proper rewrites at the right point in the query and let the system do its job - I think that won't be even a hack as opposed to multiple long long comments about doing something which is clearly questionable.

I think to do that a way could be something like:

  • override performUnconditionalRewrites to be able to process INSERT nodes as well
  • identify columns by name from SqlInsert#source
  • rewrite all columns which are of interest with a dummy typed function like druidDouble or similar
  • use the method's validation logic to ensure that the conversion will happen/okay/etc
  • the output type of the function will be the support type for that column - so that won't cause any problems
  • leave the function there - however as it was just a placeholder it can be removed during native translation without any risks..

colName,
typeFactory.createTypeWithNullability(relType, true)
));
if (NullHandling.replaceWithDefault()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the resultset may contain null even in case replaceWithDefault() is true
fyi: RowSignatures#toRelDataType creates a nullable strings regardless of replaceWithDefault

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test that includes a String column, and it seems to produce a targetType of VARCHAR NOT NULL when this is false, and a nullable VARCHAR when true, so I think this is handled correctly, but let me know if you think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for string types, default value mode treats null and '' interchangeably, so it probably should be marked as nullable? tbh its not super well defined, but we do always mark string values as nullable in default value mode since the empty strings behave as null.

For numbers though default value mode effectively means that they don't exist.

This default value mode is deprecated now, so we probably don't have to spend too much time thinking about how things should behave, so we should probably match the computed schema stuff (strings always nullable in either mode, numbers only nullable in sql compatible mode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if that's the case the logic should not live here in this file - instead it should be in a more central location.... its bad to see these conditions flow everywhere

@zachjsh zachjsh requested a review from kgyrtkirk March 18, 2024 16:05
/**
* Test the use of catalog specs to drive MSQ ingestion.
*/
public class CatalogInsertTest extends CalciteCatalogInsertTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a lots of duplication with CatalogInsertTest and CatalogReplaceTest (4 lines differ)

and also CatalogQueryTest is pretty similar - seems like it has a different method order...

could you put all these into some common place (a rule?) and reuse that everywhere?

import org.apache.druid.sql.calcite.table.DatasourceTable;
import org.apache.druid.sql.calcite.table.DruidTable;

public class CalciteCatalogIngestionDmlTest extends CalciteIngestionDmlTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there is no @Test in this "Test"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a base class, maybe javadoc to explain that would be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case it should be abstract and not have a Test ending ; like TestBase or something

// with any coercions applied. We update the validated node type of the WITH node here so
// that they are consistent.
if (source instanceof SqlWith) {
final RelDataType withBodyType = getValidatedNodeTypeIfKnown(((SqlWith) source).body);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be the consequence of the long commented lines around line 223 - about not knowing the types - so just pass unknown -> I think that should be fixed somehow ; and then this wouldn't be needed either...

colName,
typeFactory.createTypeWithNullability(relType, true)
));
if (NullHandling.replaceWithDefault()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if that's the case the logic should not live here in this file - instead it should be in a more central location.... its bad to see these conditions flow everywhere

import org.apache.druid.sql.calcite.table.DatasourceTable;
import org.apache.druid.sql.calcite.table.DruidTable;

public class CalciteCatalogIngestionDmlTest extends CalciteIngestionDmlTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a base class, maybe javadoc to explain that would be useful

}

@Test
public void testInsertIntoExistingWithIncompatibleTypeAssignment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be nice to add a few more tests like this to encode some of the behaviors in tests

@zachjsh zachjsh merged commit 8370db1 into apache:master Mar 25, 2024
85 checks passed
@zachjsh zachjsh deleted the validate-target-column-types branch March 25, 2024 16:34
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants