Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-25171] Validation of duplicate fields in ddl sql #18017

Closed
wants to merge 13 commits into from

Conversation

jelly-1203
Copy link

@jelly-1203 jelly-1203 commented Dec 6, 2021

What is the purpose of the change

The purpose of this pull Request is to add validation to the derived table that the fields are duplicated

Brief change log

  • Added verification of whether the derived table fields are duplicated

Verifying this change

  • Added tests to verify the exception is thrown in the correct scenarios

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@jelly-1203 jelly-1203 changed the title [FLINK-25171] Added validation for duplicate fields in derived tables [FLINK-25171] Validation of duplicate fields in derived tables Dec 6, 2021
@flinkbot
Copy link
Collaborator

flinkbot commented Dec 6, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 6, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 99e0300 (Mon Dec 06 07:16:08 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@wenlong88
Copy link
Contributor

Thanks for the contribution, @jelly-1203 , you may need to add some test on the change.
By the way, you can do some clean up on description of pr. The template is a guidance to help you fill in information which is important for reviewers.

@jelly-1203
Copy link
Author

Thanks for the contribution, @jelly-1203 , you may need to add some test on the change. By the way, you can do some clean up on description of pr. The template is a guidance to help you fill in information which is important for reviewers.

@wenlong88 Thanks for your comments, I will add tests for this change later and tidy up the description of this pr

@jelly-1203 jelly-1203 closed this Dec 7, 2021
@jelly-1203 jelly-1203 reopened this Dec 7, 2021
@jelly-1203
Copy link
Author

Thanks for the contribution, @jelly-1203 , you may need to add some test on the change. By the way, you can do some clean up on description of pr. The template is a guidance to help you fill in information which is important for reviewers.

@wenlong88 Hello, I have added test for this change and modified the description of this pr, please help to review it again

@xuyangzhong
Copy link
Contributor

Hi, thanks for your contribution, @jelly-1203 . It seems that you have a bad code style in your new code. You can see it in CI report above. You can follow this doc to set the code style to automate it in IDEA: https://nightlies.apache.org/flink/flink-docs-master/docs/flinkdev/ide_setup/#code-formatting

@jelly-1203
Copy link
Author

jelly-1203 commented Dec 8, 2021

Hi, thanks for your contribution, @jelly-1203 . It seems that you have a bad code style in your new code. You can see it in CI report above. You can follow this doc to set the code style to automate it in IDEA: https://nightlies.apache.org/flink/flink-docs-master/docs/flinkdev/ide_setup/#code-formatting

Hi, @xuyangzhong @wenlong88 thanks for your comments, I have adjusted the code style, compilation is still not passed, by looking at the error message, found in the org.apache.flink.table.planner.plan.stream.sql.UnionTest, An SQL statement with repeated fields was executed, so the compilation failed. Please kindly ask whether I need to mention an issue and modify the SQL?

Error code:

util.tableEnv.executeSql(
  s"""
     |CREATE TABLE t1 (
     |  id int,
     |  ts bigint,
     |  name string,
     |  timestamp_col timestamp(3),
     |  val bigint,
     |  name varchar(32),
     |  timestamp_ltz_col as TO_TIMESTAMP_LTZ(ts, 3),
     |  watermark for timestamp_col as timestamp_col
     |) WITH (
     |  'connector' = 'values',
     |  'bounded' = 'false'
     |)
   """.stripMargin)

@xuyangzhong
Copy link
Contributor

xuyangzhong commented Dec 9, 2021

Hi, @jelly-1203. You can rename one of the "name" for convenience. You needn't to mention your issue here because the GIT can record this change. If someone confuses this modifier, he can find your issue. So just modify it directly and code review will verify this correctness about this modifier. BTW, you may also need to modify the generated plan in UnionTest.xml.

@jelly-1203
Copy link
Author

Hi, @jelly-1203. You can rename one of the "name" for convenience. You needn't to mention your issue here because the GIT can record this change. If someone confuses this modifier, he can find your issue. So just modify it directly and code review will verify this correctness about this modifier. BTW, you may also need to modify the generated plan in UnionTest.xml.

Hi, @xuyangzhong, Thanks for your comment, I will modify it and note the match to the logical plan in uniontest.xml

Copy link
Contributor

@xuyangzhong xuyangzhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments following.

@@ -40,10 +40,9 @@ class UnionTest extends TableTestBase {
|CREATE TABLE t1 (
| id int,
| ts bigint,
| name string,
| name varchar(32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to delete one of them, I think it's better to keep the one that has the same type with other tables.

if (oldType != null) {
throw new ValidationException(
String.format(
"A column named '%s' already exists in the derived table.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception message is confusing where the derived table is, because users maybe only use "create table ..." instead of "create table ... like ...". But the previous one can also cause this bug. IMO, you can delete the word 'derived'.

@jelly-1203 jelly-1203 changed the title [FLINK-25171] Validation of duplicate fields in derived tables [FLINK-25171] Validation of duplicate fields in ddl sql Dec 9, 2021
@jelly-1203
Copy link
Author

I have left some comments following.

Hi,@xuyangzhong Thanks for your advice. I will make adjustments as soon as possible

Copy link
Contributor

@wenlong88 wenlong88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I left a comment. I think it is not a good place where the validation added currently.

@@ -494,7 +494,13 @@ private void collectPhysicalFieldsTypes(List<SqlNode> derivedColumns) {
boolean nullable = type.getNullable() == null ? true : type.getNullable();
RelDataType relType = type.deriveType(sqlValidator, nullable);
// add field name and field type to physical field list
physicalFieldNamesToTypes.put(name, relType);
RelDataType oldType = physicalFieldNamesToTypes.put(name, relType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.

Hi, @wenlong88 Thanks for your advice. I think it is very meaningful. I will verify it first and make adjustments in time if there is any problem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.

Hi, @wenlong88 Thanks for your advice, which is of great help to me. I found several problems during the testing process

  1. Duplicate columns are overwritten when the compute column is first and the regular column is last
  2. If the metadata column is first, use the compute column in the middle and the regular column in the rear. If the metadata column name is the same as the regular column name, use the compute column in the middle and the regular column name. In the generated comput column process, namely in accessibleFieldNamesToTypes. PutAll methods covered in repeated fields.

I have adjusted the code accordingly and added test for adjustment. Please help to review it

if (!result.isEmpty()) {
throw new ValidationException(
"A field name conflict exists between a field of the regular type and a field of the Metadata type.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can just check duplication when put the new Column to the columns, at the end of this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can just check duplication when put the new Column to the columns, at the end of this function?

hi @wenlong88
Thanks for your review and comment. I do not think duplication can be checked when putting the new Column to the columns, at the end of this function. The reasons are as follows:

  1. If computeColumn or MetadataColumn uses overwrite's Merge strategy, duplicate fields are allowed for the same type.
  2. To add physical accessibleFieldNamesToTypes columns and metadata columns, if metadataColumn before, and repeated physical column, leads to putAll, Metadata columns overwrite duplicate physical columns, which can result in a generated computeColumn that is not as expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can just check duplication when put the new Column to the columns, at the end of this function?

hi,@wenlong88 Do you think my view is correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:

  1. it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
  2. according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.

what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:

  1. it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
  2. according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.

what do you think?

hi, @wenlong88 I think your suggestion is reasonable, I will try to modify it

… columns to put a new column into columns at the end of function.
regularColumn("four", DataTypes.STRING()));

thrown.expect(ValidationException.class);
thrown.expectMessage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, I think it would be better to throw with information that there are duplicate column with name 'two' in metadata column and regular column?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contributor

Ok, I'll return this error message more explicitly

if (!result.isEmpty()) {
throw new ValidationException(
"A field name conflict exists between a field of the regular type and a field of the Metadata type.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:

  1. it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
  2. according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.

what do you think?

@jelly-1203
Copy link
Author

Hi, @wenlong88 I have adjusted the position of the verification logic and made the error message return more clear. Could you please help to review it and see what needs to be improved? Thank you

@jelly-1203
Copy link
Author

Hi, @wenlong88 Could you please review it and see what needs to be improved

@wenlong88
Copy link
Contributor

LGTM,cc @godfreyhe to do the final check

@jelly-1203
Copy link
Author

Hi, @godfreyhe
Please have time to do the final check. If there is any deficiency, I will continue to improve

@jelly-1203
Copy link
Author

Hi,@wenlong88 could you please ping @godfreyhe again? There is no progress on this issue

@jelly-1203
Copy link
Author

anyone?

@wenlong88
Copy link
Contributor

@jelly-1203 thanks for the following up, I would ping @godfreyhe offline to follow the pr.

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, LGTM, I will merge it

@godfreyhe godfreyhe closed this in 34de398 Jan 25, 2022
godfreyhe pushed a commit that referenced this pull request Jan 25, 2022
niklassemmler pushed a commit to niklassemmler/flink that referenced this pull request Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants