Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move INSERT & REPLACE validation to the Calcite validator #15908

Merged
merged 14 commits into from
Feb 22, 2024

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Feb 14, 2024

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…ct source node.

Implemented validateInsert method in DruidSqlValidtor to validate the respective node
and moved a lot of the validation being done previously in the ingestHandlers to this
overriden method. In the next commit will try to pull back out some of this validation
to make this code change smaller and easier to review.
@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 14, 2024
@zachjsh zachjsh changed the title ingestHanlders Validate sql insert node (WIP) ingestHandlers Validate sql insert node (WIP) Feb 14, 2024
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments; I'm not sure if I understand every bits of it...

are these codes covered by tests?
there are quite a few new incorrect state checks; it would be nice to cover at least some of them with directed tests

(SqlNodeList) operands[5],
// Must match DruidSqlReplace.getOperandList()
operands[6],
null // fix this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what needs to be fixed? I would recommend to either use FIXME and address it before merging the patch; or remove this comment and make sure that bad doesn't happen by providing some reasonable exception

Copy link
Contributor Author

@zachjsh zachjsh Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in order for this to work properly, the exportFileFormat needs to be changed into an SqlNode, and added as an operand. Without this, I dont think that parameterized queries using export capabilites will work. cc @adarshsanjeev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

{
return new DruidSqlReplace(
pos,
// Must match SqlInsert.getOperandList()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand these comments - it doesn't help me understand...do we need them?
you could create a bunch of local variables with the casted types and name them accrodingly - that might help or even provide a place to add comments...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comments, let me know if ok now.

if (!query.isA(SqlKind.QUERY)) {
throw InvalidSqlInput.exception("Unexpected SQL statement type [%s], expected it to be a QUERY", query.getKind());
}
return DruidSqlInsert.create(new SqlInsert(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think contents of this method could be just pushed into DruidSqlInsert ; it seems like all this supposed to belong to there...

...or there is something I've missed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment here: #15908 (comment)

}
}

private void validateSegmentGranularity(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing a method which is void and named as validateSegmentGranularity I have a feeling that the method have lost its context;

meaning: ideally this should be named something like getEffectiveGranularity and return with Granularity - but in the process of getting the actual granularity it must also identify invalid cases...

do you see any possible better place for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed. let me know if better now.

final List<RelDataTypeField> sourceFields = sourceType.getFieldList();
for (final RelDataTypeField sourceField : sourceFields) {
// Check that there are no unnamed columns in the insert.
if (UNNAMED_COLUMN_PATTERN.matcher(sourceField.getName()).matches()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any test for this exception? what if the pattern doesn't match anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! there are tests in CalciteInsertDmlTest, for example, org.apache.druid.sql.calcite.CalciteInsertDmlTest#testInsertWithUnnamedColumnInSelectStatement

if (query instanceof SqlOrderBy) {
SqlOrderBy sqlOrderBy = (SqlOrderBy) query;
SqlNodeList orderByList = sqlOrderBy.orderList;
if (!(orderByList == null || orderByList.equals(SqlNodeList.EMPTY))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have all this here? isn't DruidSqlIngest subclassed to have DruidSqlInsert?
shouldn't this happen in DruidSqlInsert when it gets created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing it in DruidSqlInsert construction changes the output of unparse. Not sure if this is ok. There are a few tests that are testing the expected output of unparse in DruidSqlUnparseTest.java. I can update these tests, but just not sure if doing this has other ramifications?

@zachjsh zachjsh marked this pull request as ready for review February 21, 2024 08:58
@zachjsh zachjsh changed the title ingestHandlers Validate sql insert node (WIP) Move INSERT & REPLACE validation to the Calcite validator Feb 21, 2024
// Copied here from MSQE since that extension is not visible here.
public static final String CTX_ROWS_PER_SEGMENT = "msqRowsPerSegment";

public interface ValidatorContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: unused interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

// The target namespace is both the target table ID and the row type for that table.
final SqlValidatorNamespace targetNamespace = getNamespace(insert);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a copy getNamespaceOrThrow over from SqlValidatorImpl or use requireNonNull (just to avoid possible issue if it ends up being null)

// know names and we match by name.) Thus, we'd have to validate (to know names and types)
// to get the target types, but we need the target types to validate. Catch-22. So, we punt.
final SqlValidatorScope scope;
if (source instanceof SqlSelect) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: it seems like most of this is copied over from SqlValidatorImpl#validateInsert ; extended / refactored / methods were extracted / commented on...
I think it might be challenging to maintain this in the long run - however validateInsert doesn't seem to be changing very often
I think it might be usefull to leave some comments about the origins of this method as an apidoc of validateInsert method

for (final RelDataTypeField sourceField : sourceFields) {
// Check that there are no unnamed columns in the insert.
if (UNNAMED_COLUMN_PATTERN.matcher(sourceField.getName()).matches()) {
throw InvalidSqlInput.exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you replace these exceptions with ones communicating the SqlNode if its interesting/valuable....

if its a general error that doesn't matter...but this error is specific to a selected column:

        throw buildCalciteContextException(
            "Insertion requires columns to be named....",
            getSqlNodeFor(insert, sourceFields.indexOf(sourceField))

rough sqlNodeFor method:

  SqlNode getSqlNodeFor(SqlInsert insert, int idx) {
    SqlNode src = insert.getSource();
    if(src instanceof SqlSelect) {
      SqlSelect sqlSelect = (SqlSelect) src;
      SqlNodeList selectList = sqlSelect.getSelectList();
      if(idx < selectList.size()) {
      return selectList.get(idx);
      }
    }
    return src;
  }

@@ -1765,7 +1762,6 @@ public void testErrorWhenInputSourceInvalid()
+ "partitioned by DAY\n"
+ "clustered by channel";
HashMap<String, Object> context = new HashMap<>(DEFAULT_CONTEXT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the testInsertWithInvalidColumnNameInIngest testcase; but the intention of the check is to ensure that something like:

INSERT INTO t SELECT __time, 1+1 FROM foo PARTITIONED BY ALL

is catched ; could you change or add something like this as a testcase?

@zachjsh zachjsh merged commit 8ebf237 into apache:master Feb 22, 2024
82 of 83 checks passed
@zachjsh zachjsh deleted the validate-sqlInsert-node branch February 22, 2024 19:02
@abhishekagarwal87
Copy link
Contributor

@zachjsh - This was merged without a committer approval. Can you revert it and open another PR so that you can get a committer approval?

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retroactive approval - LGTM overall. Left a few minor comments which can be addressed in a follow-up.

public void testInsertHourGrain()
{
testIngestionQuery()
.sql("INSERT INTO hourDs\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up, could you also please add a test for a REPLACE query where PARTITIONED BY clause in the query is omitted?

}

/**
* If the segment grain is given in the catalog then use this value is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If the segment grain is given in the catalog then use this value is used.
If the segment grain is given in the catalog and absent in the PARTITIONED BY clause in the query, then use the value from the catalog.

Comment on lines +54 to +55
@Nullable SqlIdentifier exportFileFormat,
@Nullable SqlNode replaceTimeQuery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, is the order of args swapped for a reason?

// Copied here from MSQE since that extension is not visible here.
public static final String CTX_ROWS_PER_SEGMENT = "msqRowsPerSegment";

public interface ValidatorContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -1679,7 +1679,6 @@ public void testErrorWithUnableToConstructColumnSignatureWithExtern()
+ "partitioned by DAY\n"
+ "clustered by channel";
HashMap<String, Object> context = new HashMap<>(DEFAULT_CONTEXT);
context.put(PlannerContext.CTX_SQL_OUTER_LIMIT, 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, are these context parameters not required by these tests anymore?

zachjsh added a commit to zachjsh/druid that referenced this pull request Feb 27, 2024
zachjsh added a commit that referenced this pull request Mar 25, 2024
…urce input expressions (#15962)

* * address remaining comments from #15836

* *  address remaining comments from #15908

* * add test that exposes relational algebra issue

* * simplify test exposing issue

* * fix

* * add tests for sealed / non-sealed

* * update test descriptions

* * fix test failure when -Ddruid.generic.useDefaultValueForNull=true

* * check type assignment based on natice Druid types

* * add tests that cover missing jacoco coverage

* * add replace tests

* * add more tests and comments about column ordering

* * simplify tests

* * review comments

* * remove commented line

* * STRING family types should be validated as non-null
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants