-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-25171] Validation of duplicate fields in ddl sql #18017
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 99e0300 (Mon Dec 06 07:16:08 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
Thanks for the contribution, @jelly-1203 , you may need to add some test on the change. |
@wenlong88 Thanks for your comments, I will add tests for this change later and tidy up the description of this pr |
@wenlong88 Hello, I have added test for this change and modified the description of this pr, please help to review it again |
Hi, thanks for your contribution, @jelly-1203 . It seems that you have a bad code style in your new code. You can see it in CI report above. You can follow this doc to set the code style to automate it in IDEA: https://nightlies.apache.org/flink/flink-docs-master/docs/flinkdev/ide_setup/#code-formatting |
Hi, @xuyangzhong @wenlong88 thanks for your comments, I have adjusted the code style, compilation is still not passed, by looking at the error message, found in the org.apache.flink.table.planner.plan.stream.sql.UnionTest, An SQL statement with repeated fields was executed, so the compilation failed. Please kindly ask whether I need to mention an issue and modify the SQL? Error code:
|
Hi, @jelly-1203. You can rename one of the "name" for convenience. You needn't to mention your issue here because the GIT can record this change. If someone confuses this modifier, he can find your issue. So just modify it directly and code review will verify this correctness about this modifier. BTW, you may also need to modify the generated plan in UnionTest.xml. |
Hi, @xuyangzhong, Thanks for your comment, I will modify it and note the match to the logical plan in uniontest.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments following.
@@ -40,10 +40,9 @@ class UnionTest extends TableTestBase { | |||
|CREATE TABLE t1 ( | |||
| id int, | |||
| ts bigint, | |||
| name string, | |||
| name varchar(32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to delete one of them, I think it's better to keep the one that has the same type with other tables.
if (oldType != null) { | ||
throw new ValidationException( | ||
String.format( | ||
"A column named '%s' already exists in the derived table.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception message is confusing where the derived table is, because users maybe only use "create table ..." instead of "create table ... like ...". But the previous one can also cause this bug. IMO, you can delete the word 'derived'.
…of the name field of table T1 in UnionTest
Hi,@xuyangzhong Thanks for your advice. I will make adjustments as soon as possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I left a comment. I think it is not a good place where the validation added currently.
@@ -494,7 +494,13 @@ private void collectPhysicalFieldsTypes(List<SqlNode> derivedColumns) { | |||
boolean nullable = type.getNullable() == null ? true : type.getNullable(); | |||
RelDataType relType = type.deriveType(sqlValidator, nullable); | |||
// add field name and field type to physical field list | |||
physicalFieldNamesToTypes.put(name, relType); | |||
RelDataType oldType = physicalFieldNamesToTypes.put(name, relType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.
Hi, @wenlong88 Thanks for your advice. I think it is very meaningful. I will verify it first and make adjustments in time if there is any problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not enough to add check here, when there is name conflicts between computedColumn or metadata Column, the check here would not work well. you can try to add validation in appendDerivedColumns.
Hi, @wenlong88 Thanks for your advice, which is of great help to me. I found several problems during the testing process
- Duplicate columns are overwritten when the compute column is first and the regular column is last
- If the metadata column is first, use the compute column in the middle and the regular column in the rear. If the metadata column name is the same as the regular column name, use the compute column in the middle and the regular column name. In the generated comput column process, namely in accessibleFieldNamesToTypes. PutAll methods covered in repeated fields.
I have adjusted the code accordingly and added test for adjustment. Please help to review it
if (!result.isEmpty()) { | ||
throw new ValidationException( | ||
"A field name conflict exists between a field of the regular type and a field of the Metadata type."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can just check duplication when put the new Column to the columns, at the end of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can just check duplication when put the new Column to the columns, at the end of this function?
hi @wenlong88
Thanks for your review and comment. I do not think duplication can be checked when putting the new Column to the columns, at the end of this function. The reasons are as follows:
- If computeColumn or MetadataColumn uses overwrite's Merge strategy, duplicate fields are allowed for the same type.
- To add physical accessibleFieldNamesToTypes columns and metadata columns, if metadataColumn before, and repeated physical column, leads to putAll, Metadata columns overwrite duplicate physical columns, which can result in a generated computeColumn that is not as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can just check duplication when put the new Column to the columns, at the end of this function?
hi,@wenlong88 Do you think my view is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:
- it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
- according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:
- it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
- according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.
what do you think?
hi, @wenlong88 I think your suggestion is reasonable, I will try to modify it
… columns to put a new column into columns at the end of function.
regularColumn("four", DataTypes.STRING())); | ||
|
||
thrown.expect(ValidationException.class); | ||
thrown.expectMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, I think it would be better to throw with information that there are duplicate column with name 'two' in metadata column and regular column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contributor
Ok, I'll return this error message more explicitly
if (!result.isEmpty()) { | ||
throw new ValidationException( | ||
"A field name conflict exists between a field of the regular type and a field of the Metadata type."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @jelly-1203 thanks for the update and analysis, I agree that it not possible to add unified check at the end. but I still think that the check here can be improved a bit:
- it seems that the check here is not relevant to the new added ComputingColumn, if you want to check the duplication, it is better to add it when updating metadataFieldNamesToTypes.
- according to current implementation, regular columns have top priority(we collect collectPhysicalFieldsTypes at the beginning), so we may also need to check if there is duplicated name in physicalFieldNamesToTypes when trying to a metadata column or computed column. If we add such check, the check in 1 is not necessary any more.
what do you think?
Hi, @wenlong88 I have adjusted the position of the verification logic and made the error message return more clear. Could you please help to review it and see what needs to be improved? Thank you |
Hi, @wenlong88 Could you please review it and see what needs to be improved |
LGTM,cc @godfreyhe to do the final check |
Hi, @godfreyhe |
Hi,@wenlong88 could you please ping @godfreyhe again? There is no progress on this issue |
anyone? |
@jelly-1203 thanks for the following up, I would ping @godfreyhe offline to follow the pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response, LGTM, I will merge it
What is the purpose of the change
The purpose of this pull Request is to add validation to the derived table that the fields are duplicated
Brief change log
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation