-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: support watermark and computed columns #4625
Conversation
relevant pr: #2265, #3681 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this big undertaking @wuwenchi!
I believe Flink 1.15 was released, so I am going to rebase my Flink 1.15 support PR to the release branch and we should have 1.15 support in the next few days.
Just a heads up that anything done here will eventually be required to work with that. But I’d wait to get more feedback, given the large scope of these features.
Happy to see the interest in this!
BTW - Have you given any thought to possibly trying to support tables with hidden partitioning (that use partition transforms other than identity transform) via generated columns? Likely out of scope, but that’s one way I’ve been considering to add support.
@kbendick Thanks for your reply!
Is this feature already supported on the flink 1.15 branch you are working on now? Or should we wait until flink1.15 is branched out and then support this function through this PR?
Yes, this feature is very useful to me, and the next plan is to support this feature. If you don't mind, can I join this feature? |
nice job |
I have a uber question regarding storing those properties into Iceberg table properties. Watermark config is very job specific. Different Flink jobs may have different watermark assignment.
There are some discussions in PR #3681 |
@stevenzwu Thanks for your reply! We have also considered this issue. So my thoughts are: For the same table
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. This feature is useful for a lot of work, and I think we should put some thought into 'Schema' and 'ResoveSchema'. I look forward to more feature support after we finish discussion on 'ResoveSchema' and etc.
@@ -52,35 +63,69 @@ | |||
*/ | |||
public class FlinkSchemaUtil { | |||
|
|||
public static final String FLINK_PREFIX = "flink."; | |||
|
|||
public static final String COMPUTED_COLUMNS = "computed-columns."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use computed-column
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, 's' is redundant, I will fix it.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(new Configuration()); | ||
StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(env); | ||
CatalogManager catalogManager = ((TableEnvironmentImpl) streamTableEnvironment).getCatalogManager(); | ||
SchemaResolver schemaResolver = catalogManager.getSchemaResolver(); | ||
return table.getUnresolvedSchema().resolve(schemaResolver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(new Configuration()); | |
StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(env); | |
CatalogManager catalogManager = ((TableEnvironmentImpl) streamTableEnvironment).getCatalogManager(); | |
SchemaResolver schemaResolver = catalogManager.getSchemaResolver(); | |
return table.getUnresolvedSchema().resolve(schemaResolver); | |
Configuration configuration = ExecutionEnvironment.getExecutionEnvironment().getConfiguration(); | |
TableEnvironment tableEnvironment = TableEnvironment.create(configuration); | |
Preconditions.checkArgument(table.getDescription().isPresent(), "Illegal table."); | |
return tableEnvironment.from(table.getDescription().get()).getResolvedSchema(); |
Based on your comment(#4246 (comment)), I took a look at our previous proposal and made some suggestions, but it may not be optimal.
Here, I think using 'TableEnvironment' directly is a better choice. It has better commonality, whether streaming, batch, or otherwise, and hides underlying differences.
Let's see if somebody else has a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is indeed better!
But the description is taken from the table's comment:
@Override
public Optional<String> getDescription() {
return Optional.of(getComment());
}
The comment does not necessarily exist, and if it exists, it is not necessarily the path of the table, so tableEnvironment.from
may not be able to get the table...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so if we go this way, well, we might need to add a comment here.
iceberg/flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java
Lines 610 to 614 in 1300b8a
// NOTE: We can not create a IcebergCatalogTable extends CatalogTable, because Flink optimizer may use | |
// CatalogTableImpl to copy a new catalog table. | |
// Let's re-loading table from Iceberg catalog when creating source/sink operators. | |
// Iceberg does not have Table comment, so pass a null (Default comment value in Flink). | |
return CatalogTable.of(schema, null, partitionKeys, table.properties()); |
Alternatively, we can get the table via Path
. It should be in ObjectPath
.
TableEnvironment#Table from(String path);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tableEnvironment
here is a newly created environment with only default_catalog and default_database, but this table actually exists in the catalog in the flink environment, so I think tableEnvironment.from
should not find the required table. I also actually tested it and couldn't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit tricky. I'll debug it locally to see if there's a better way.
env.getConfig().getConfiguration().set(FlinkConfigOptions.TABLE_EXEC_ICEBERG_INFER_SOURCE_PARALLELISM, false); | ||
env.getConfig().getConfiguration() | ||
.set(FlinkConfigOptions.TABLE_EXEC_ICEBERG_INFER_SOURCE_PARALLELISM, false) | ||
.set(TableConfigOptions.LOCAL_TIME_ZONE, "UTC"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good
@@ -374,7 +378,7 @@ void createIcebergTable(ObjectPath tablePath, CatalogBaseTable table, boolean ig | |||
throws CatalogException, TableAlreadyExistException { | |||
validateFlinkTable(table); | |||
|
|||
Schema icebergSchema = FlinkSchemaUtil.convert(table.getSchema()); | |||
Schema icebergSchema = FlinkSchemaUtil.convert(((ResolvedCatalogTable) table).getResolvedSchema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always safe to type cast to ResolvedCatalogTable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in this interface, we always get ResolvedCatalogTable
nice pr |
2. create environment according to executionEnvironment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wuwenchi for working on this!
I too would like to see watermark support and computed column support, particularly so that we can have partition transforms.
I have some concern, like @stevenzwu mentioned, about using table properties for things that might be very job-specific.
For example, I know that @chenjunjiedada has recommended that we don’t use the write.upsert-enabled
config and instead set that per job. While we can’t immediately deprecate that configuration (and may never), I agree with his advice and I’d be hesitant to add more table properties that should be job-level properties.
My other concern is using flink.
in table property names. While it might be the case today that these features are only supported by Flink (such as write.upsert-enabled
configuration has been), we should ideally name things without regard for any specific engine as hopefully those engines will support such features in the future (or we can help add support for them).
Overall, I do really want to commend you on a job well done for this. I think it likely needs some changes, and might benefit from some kind of design discussion first, but truly thank you for making this PR as I’m sure much of this code will likely be useful for however this winds up getting designed (be it this way or another). And if this code is used it should definitely be credited.
I’m interested in exploring some of @yittg’s ideas from #4251 and combining some of them with your ideas, such as this updated idea of yours here #5000
As for computed columns, we try to support it in Iceberg itself, so that not only Flink, but other engines can also use it, but this change is quite large, just for reference #4994. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
We hope iceberg can support watermark and computed columns.
The specific implementation details are as follows:
Background and Motivation
There is a temporal table in iceberg, which needs to be used for real-time join with other tables, so a watermark needs to be defined. However, the watermark has requirements for the attributes of the fields, and the fields representing time in the source table may not meet this requirement, so it is necessary to convert the time fields of the source table by recalculating or calling functions.
Why not directly use the iceberg connector table supported by flink?
Goal
Proposal
Table property save format
Example
Then the table properties are saved as:
key format
fixed prefix + field name:
flink.watermark.
flink.computed-columns.
value format
defined expression from user.
The way of addition, deletion and modification
1、DDL of flink-sql, only supports adding
2、Syntax of table properties
Solution
add table process
alter table properties process
Merge the modified properties with the original properties to get the updated properties of the table.
Generate the table of flink from the merged table properties and the schema of the original table, and verify the schema. (To prevent errors in expression, such as writing a non-existing function or column name in the expression of the computed column.)
If the verification in the previous step is successful, submit this property modification,
if not, report exception and do not make the final submission.
get table process
Generate the flink table by combining the current table properties with the table schema, and verify the schema. (To prevent the schema of the table from being modified by other engines, resulting in an error in the expression of the calculated column. For example, in the expression of the computed column:
id
* 3, but then theid
column was deleted using spark, and the corresponding property for computed column was not deleted.)If the verification is successful, the table is returned.
If it is unsuccessful, the computed columns and watermarks in the table properties are ignored, and the original table physical schema is returned directly.