Flink: support watermark and computed columns #4625

wuwenchi · 2022-04-25T03:46:07Z

We hope iceberg can support watermark and computed columns.
The specific implementation details are as follows:

Background and Motivation

There is a temporal table in iceberg, which needs to be used for real-time join with other tables, so a watermark needs to be defined. However, the watermark has requirements for the attributes of the fields, and the fields representing time in the source table may not meet this requirement, so it is necessary to convert the time fields of the source table by recalculating or calling functions.

Why not directly use the iceberg connector table supported by flink?

All flink tables are stored in memory. When the task ends, the table disappears. If you want to use it next time, you need to create it again.
In the scenes we use, a temporal table needs to be joined with multiple tables, and the attributes of the watermark are only related to the temporal table, so the watermark attributes of these tasks are consistent, so we hope to directly associate these attributes with Temporal table binding, when you use this table again in the future, you don't need to go to other tasks to check the watermark attribute setting parameters of the table.

Goal

Iceberg supports flink-sql to set watermark and computed columns.
The watermark attribute and computed column attribute can be modified by modifying the table properties.

Proposal

Table property save format

Example

CREATE TABLE tl (
  id INT, 
  id2 AS id * 2, 
  f1 AS TO_TIMESTAMP(FROM_UNIXTIME(id*3)),
  t1 TIMESTAMP(6), 
  t2 AS cast(t1 AS TIMESTAMP(3)), 
  watermark FOR t2 AS t2 - INTERVAL '5' SECOND
);

Then the table properties are saved as:

flink.computed-columns.id2 = `id` * 2
flink.computed-columns.f1 = TO_TIMESTAMP(FROM_UNIXTIME(`id` * 3))
flink.computed-columns.t2 = CAST(`t1` AS TIMESTAMP(3))
flink.watermark.t2 = `t2` - INTERVAL '5' SECOND

key format

fixed prefix + field name:

fixed prefix for watermark: flink.watermark.
fixed prefix for computed columns: flink.computed-columns.

value format

defined expression from user.

The way of addition, deletion and modification

1、DDL of flink-sql, only supports adding

CREATE TABLE `hive_catalog`.`default`.`sample` (
  id INT, 
  id2 AS id * 2, 
  f1 AS TO_TIMESTAMP(FROM_UNIXTIME(id*3)),
  t1 TIMESTAMP(6), 
  t2 AS cast(t1 AS TIMESTAMP(3)), 
  watermark FOR t2 AS t2 - INTERVAL '5' SECOND
);

2、Syntax of table properties

add or update

ALTER TABLE `hive_catalog`.`default`.`sample` SET (
	'flink.computed-columns.id2'='id*3'
)

delete

ALTER TABLE `hive_catalog`.`default`.`sample` RESET (
	'flink.computed-columns.id2'
)

Solution

add table process

If there is a defined computed column in the table, save its expression to the table property.
If there is a defined watermark in the table, save its expression to the table property.

alter table properties process

Merge the modified properties with the original properties to get the updated properties of the table.
Generate the table of flink from the merged table properties and the schema of the original table, and verify the schema. (To prevent errors in expression, such as writing a non-existing function or column name in the expression of the computed column.)
If the verification in the previous step is successful, submit this property modification,
if not, report exception and do not make the final submission.

get table process

Generate the flink table by combining the current table properties with the table schema, and verify the schema. (To prevent the schema of the table from being modified by other engines, resulting in an error in the expression of the calculated column. For example, in the expression of the computed column: id * 3, but then the id column was deleted using spark, and the corresponding property for computed column was not deleted.)
If the verification is successful, the table is returned.
If it is unsuccessful, the computed columns and watermarks in the table properties are ignored, and the original table physical schema is returned directly.

wuwenchi · 2022-04-25T03:49:49Z

relevant pr: #2265, #3681
Could you help review it? Thanks! @openinx @rdblue @hameizi @zhangjun0x01 @stevenzwu

kbendick

Thank you for this big undertaking @wuwenchi!

I believe Flink 1.15 was released, so I am going to rebase my Flink 1.15 support PR to the release branch and we should have 1.15 support in the next few days.

Just a heads up that anything done here will eventually be required to work with that. But I’d wait to get more feedback, given the large scope of these features.

Happy to see the interest in this!

BTW - Have you given any thought to possibly trying to support tables with hidden partitioning (that use partition transforms other than identity transform) via generated columns? Likely out of scope, but that’s one way I’ve been considering to add support.

wuwenchi · 2022-04-26T07:12:14Z

@kbendick Thanks for your reply!

I believe Flink 1.15 was released, so I am going to rebase my Flink 1.15 support PR to the release branch and we should have 1.15 support in the next few days.

Is this feature already supported on the flink 1.15 branch you are working on now? Or should we wait until flink1.15 is branched out and then support this function through this PR?

Have you given any thought to possibly trying to support tables with hidden partitioning (that use partition transforms other than identity transform) via generated columns? Likely out of scope, but that’s one way I’ve been considering to add support.

Yes, this feature is very useful to me, and the next plan is to support this feature. If you don't mind, can I join this feature?

Zhangg7723 · 2022-04-28T02:47:47Z

nice job

wuwenchi · 2022-05-01T01:54:31Z

@kbendick Can we consider this way to implement partition transforms #4251 @yittg ?

stevenzwu · 2022-05-02T03:16:17Z

I have a uber question regarding storing those properties into Iceberg table properties.

Watermark config is very job specific. Different Flink jobs may have different watermark assignment.

flink.watermark.t2 = `t2` - INTERVAL '5' SECOND

There are some discussions in PR #3681

wuwenchi · 2022-05-02T04:02:00Z

@stevenzwu Thanks for your reply!

We have also considered this issue.
But in fact, when our users are using, more scenarios are that a table needs to be joined with many other tables, and each association may be a separate task, so the watermak attributes of these tasks are actually the same.
If we don't have this feature, users have to look at the watermark property of other tasks to create a new task.

So my thoughts are: For the same table

If you need different watermark attributes, you can use the flink connector to achieve it, create a connector for each task, and then specify different watermarks.
If you need the same watermark attributes, you can use the iceberg table directly.

hililiwei

Great job. This feature is useful for a lot of work, and I think we should put some thought into 'Schema' and 'ResoveSchema'. I look forward to more feature support after we finish discussion on 'ResoveSchema' and etc.

hililiwei · 2022-05-05T02:16:19Z

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/FlinkSchemaUtil.java

@@ -52,35 +63,69 @@
 */
 public class FlinkSchemaUtil {

+  public static final String FLINK_PREFIX = "flink.";
+
+  public static final String COMPUTED_COLUMNS = "computed-columns.";


use computed-column ?

yes, 's' is redundant, I will fix it.

hililiwei · 2022-05-05T02:49:15Z

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/FlinkSchemaUtil.java

+    StreamExecutionEnvironment env =  StreamExecutionEnvironment.getExecutionEnvironment(new Configuration());
+    StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(env);
+    CatalogManager catalogManager = ((TableEnvironmentImpl) streamTableEnvironment).getCatalogManager();
+    SchemaResolver schemaResolver = catalogManager.getSchemaResolver();
+    return table.getUnresolvedSchema().resolve(schemaResolver);


Suggested change

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(new Configuration());

StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(env);

CatalogManager catalogManager = ((TableEnvironmentImpl) streamTableEnvironment).getCatalogManager();

SchemaResolver schemaResolver = catalogManager.getSchemaResolver();

return table.getUnresolvedSchema().resolve(schemaResolver);

Configuration configuration = ExecutionEnvironment.getExecutionEnvironment().getConfiguration();

TableEnvironment tableEnvironment = TableEnvironment.create(configuration);

Preconditions.checkArgument(table.getDescription().isPresent(), "Illegal table.");

return tableEnvironment.from(table.getDescription().get()).getResolvedSchema();

Based on your comment(#4246 (comment)), I took a look at our previous proposal and made some suggestions, but it may not be optimal.
Here, I think using 'TableEnvironment' directly is a better choice. It has better commonality, whether streaming, batch, or otherwise, and hides underlying differences.
Let's see if somebody else has a better solution.

This method is indeed better!
But the description is taken from the table's comment:

@Override public Optional<String> getDescription() { return Optional.of(getComment()); }

The comment does not necessarily exist, and if it exists, it is not necessarily the path of the table, so tableEnvironment.from may not be able to get the table...

Yes, so if we go this way, well, we might need to add a comment here.

iceberg/flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java

Lines 610 to 614 in 1300b8a

// NOTE: We can not create a IcebergCatalogTable extends CatalogTable, because Flink optimizer may use

// CatalogTableImpl to copy a new catalog table.

// Let's re-loading table from Iceberg catalog when creating source/sink operators.

// Iceberg does not have Table comment, so pass a null (Default comment value in Flink).

return CatalogTable.of(schema, null, partitionKeys, table.properties());

Alternatively, we can get the table via Path. It should be in ObjectPath.
TableEnvironment#Table from(String path);

The tableEnvironment here is a newly created environment with only default_catalog and default_database, but this table actually exists in the catalog in the flink environment, so I think tableEnvironment.from should not find the required table. I also actually tested it and couldn't find it.

It's a bit tricky. I'll debug it locally to see if there's a better way.

hililiwei · 2022-05-05T02:53:07Z

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/FlinkTestBase.java

-          env.getConfig().getConfiguration().set(FlinkConfigOptions.TABLE_EXEC_ICEBERG_INFER_SOURCE_PARALLELISM, false);
+          env.getConfig().getConfiguration()
+              .set(FlinkConfigOptions.TABLE_EXEC_ICEBERG_INFER_SOURCE_PARALLELISM, false)
+              .set(TableConfigOptions.LOCAL_TIME_ZONE, "UTC");


stevenzwu · 2022-05-05T03:58:36Z

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java

@@ -374,7 +378,7 @@ void createIcebergTable(ObjectPath tablePath, CatalogBaseTable table, boolean ig
      throws CatalogException, TableAlreadyExistException {
    validateFlinkTable(table);

-    Schema icebergSchema = FlinkSchemaUtil.convert(table.getSchema());
+    Schema icebergSchema = FlinkSchemaUtil.convert(((ResolvedCatalogTable) table).getResolvedSchema());


is it always safe to type cast to ResolvedCatalogTable?

yes, in this interface, we always get ResolvedCatalogTable

Zhangg7723 · 2022-05-05T08:35:23Z

nice pr

2. create environment according to executionEnvironment

kbendick

Thanks @wuwenchi for working on this!

I too would like to see watermark support and computed column support, particularly so that we can have partition transforms.

I have some concern, like @stevenzwu mentioned, about using table properties for things that might be very job-specific.

For example, I know that @chenjunjiedada has recommended that we don’t use the write.upsert-enabled config and instead set that per job. While we can’t immediately deprecate that configuration (and may never), I agree with his advice and I’d be hesitant to add more table properties that should be job-level properties.

My other concern is using flink. in table property names. While it might be the case today that these features are only supported by Flink (such as write.upsert-enabled configuration has been), we should ideally name things without regard for any specific engine as hopefully those engines will support such features in the future (or we can help add support for them).

Overall, I do really want to commend you on a job well done for this. I think it likely needs some changes, and might benefit from some kind of design discussion first, but truly thank you for making this PR as I’m sure much of this code will likely be useful for however this winds up getting designed (be it this way or another). And if this code is used it should definitely be credited.

I’m interested in exploring some of @yittg’s ideas from #4251 and combining some of them with your ideas, such as this updated idea of yours here #5000

hililiwei · 2022-06-14T09:31:39Z

As for computed columns, we try to support it in Iceberg itself, so that not only Flink, but other engines can also use it, but this change is quite large, just for reference #4994.

github-actions · 2024-08-08T00:13:40Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-16T00:13:13Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Flink: support watermark and computed columns

66011a3

github-actions bot added the flink label Apr 25, 2022

吴文池 added 2 commits April 25, 2022 16:21

fix test

b0080a7

fix test

1300b8a

kbendick reviewed Apr 25, 2022

View reviewed changes

wuwenchi mentioned this pull request Apr 27, 2022

Flink: Add support for ResolvedSchema #4246

Closed

hililiwei reviewed May 5, 2022

View reviewed changes

stevenzwu reviewed May 5, 2022

View reviewed changes

1. computed-column instead of computed-columns

bfbcd0f

2. create environment according to executionEnvironment

wuwenchi mentioned this pull request Jun 9, 2022

Proposal: FlinkSQL supports partition transform by computed columns #5000

Closed

kbendick reviewed Jun 13, 2022

View reviewed changes

wuwenchi mentioned this pull request May 26, 2023

[Improvement]: Support computed columns for mixed-format tables in Flink DDL apache/amoro#1457

Closed

3 tasks

yeezychao mentioned this pull request Aug 2, 2024

The "Emitting watermarks" feature can't be used in flink sql? #10219

Open

github-actions bot added the stale label Aug 8, 2024

github-actions bot closed this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: support watermark and computed columns #4625

Flink: support watermark and computed columns #4625

wuwenchi commented Apr 25, 2022

wuwenchi commented Apr 25, 2022 •

edited

Loading

kbendick left a comment

wuwenchi commented Apr 26, 2022 •

edited

Loading

Zhangg7723 commented Apr 28, 2022

wuwenchi commented May 1, 2022

stevenzwu commented May 2, 2022

wuwenchi commented May 2, 2022

hililiwei left a comment

hililiwei May 5, 2022

wuwenchi May 5, 2022

hililiwei May 5, 2022

wuwenchi May 5, 2022

hililiwei May 5, 2022 •

edited

Loading

wuwenchi May 5, 2022

hililiwei May 6, 2022

hililiwei May 5, 2022

stevenzwu May 5, 2022

wuwenchi May 5, 2022

Zhangg7723 commented May 5, 2022

kbendick left a comment •

edited

Loading

hililiwei commented Jun 14, 2022

github-actions bot commented Aug 8, 2024

github-actions bot commented Aug 16, 2024

	// NOTE: We can not create a IcebergCatalogTable extends CatalogTable, because Flink optimizer may use
	// CatalogTableImpl to copy a new catalog table.
	// Let's re-loading table from Iceberg catalog when creating source/sink operators.
	// Iceberg does not have Table comment, so pass a null (Default comment value in Flink).
	return CatalogTable.of(schema, null, partitionKeys, table.properties());

Flink: support watermark and computed columns #4625

Flink: support watermark and computed columns #4625

Conversation

wuwenchi commented Apr 25, 2022

Background and Motivation

Goal

Proposal

Table property save format

Example

key format

value format

The way of addition, deletion and modification

Solution

add table process

alter table properties process

get table process

wuwenchi commented Apr 25, 2022 • edited Loading

kbendick left a comment

Choose a reason for hiding this comment

wuwenchi commented Apr 26, 2022 • edited Loading

Zhangg7723 commented Apr 28, 2022

wuwenchi commented May 1, 2022

stevenzwu commented May 2, 2022

wuwenchi commented May 2, 2022

hililiwei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hililiwei May 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zhangg7723 commented May 5, 2022

kbendick left a comment • edited Loading

Choose a reason for hiding this comment

hililiwei commented Jun 14, 2022

github-actions bot commented Aug 8, 2024

github-actions bot commented Aug 16, 2024

wuwenchi commented Apr 25, 2022 •

edited

Loading

wuwenchi commented Apr 26, 2022 •

edited

Loading

hililiwei May 5, 2022 •

edited

Loading

kbendick left a comment •

edited

Loading