Skip to content

Conversation

@luoyuxia
Copy link
Contributor

@luoyuxia luoyuxia commented Mar 20, 2025

Purpose

Linked issue: close #430

Brief change log

  1. Introduce LakeStoragePluginSetUp that load the LakeStoragePlugin by datalake format
  2. Introduce LakeCatalog to create table in lake
  3. When create table with lake enabeld, create the table in lake via LakeCatalog

Tests

LakeEnabledTableCreateITCase

API and Format

Documentation

@luoyuxia luoyuxia force-pushed the create-lake-table branch 4 times, most recently from ad3e4a7 to b768922 Compare March 21, 2025 03:12
<useTransitiveDependencies>true</useTransitiveDependencies>
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>org.apache.flink:flink-shaded-hadoop-2-uber</include>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paimon requires hadoop bundled, soe we include it in paimon plugin dir
https://paimon.apache.org/docs/master/flink/quick-start/

@luoyuxia luoyuxia force-pushed the create-lake-table branch from b768922 to 74248a9 Compare March 21, 2025 08:09
@wuchong
Copy link
Member

wuchong commented Mar 22, 2025

Is it ready to review? @luoyuxia

}

// set pk
if (tableDescriptor.getSchema().getPrimaryKey().isPresent()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I introduce additional offset and timestamp coumns, that's to enabled Fluss to subscribe the data in lake from a given offset and timestamp via Fluss client. But now, I feel like we can remove these two additional offset and timestamp columns at least for now.

  1. Now, we mainly focus on Flink read the historical data in paimon and real-time data in Fluss. The offset and timestamp columns is not used. Introduce these columns may bring unnecassary complexity in early stage

  2. subscribe via offset and timestamp columns only works for log table and only works for paimon with bucket-num specified. But in paimon, it's recommend not to set bucket-num. So, offset and timestamp columns become useless in most cases.

Still, we keep the possibility to support to subscribe the data in lake from a given offset and timestamp in the future. We can then introduce a option to enabled this feature for lake table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discuss, still keep the ability to subscribe via offset/timestamp, so, let's introduce another column bucket to help us to subscribe via bucket + offset.

@luoyuxia luoyuxia force-pushed the create-lake-table branch from 74248a9 to 569d5e9 Compare March 24, 2025 03:41
@luoyuxia luoyuxia marked this pull request as ready for review March 24, 2025 03:51
@luoyuxia luoyuxia force-pushed the create-lake-table branch from 569d5e9 to 8e06422 Compare March 24, 2025 03:52
@luoyuxia
Copy link
Contributor Author

Is it ready to review? @luoyuxia

Yes, now, it's ready to review.

@luoyuxia luoyuxia requested review from leonardBang and wuchong March 24, 2025 03:52
@luoyuxia
Copy link
Contributor Author

@wuchong @leonardBang Could you please help review?

@luoyuxia luoyuxia force-pushed the create-lake-table branch from 8e06422 to 85e28a2 Compare March 25, 2025 04:17
Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luoyuxia for the contribution, the PR looks generally to me, only left some minor comments

public static final String OFFSET_COLUMN_NAME = "__offset";
public static final String TIMESTAMP_COLUMN_NAME = "__timestamp";
public static final String BUCKET_COLUMN_NAME = "__bucket";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For system metadata column, could we expose system metadata column configuration for users to avoid potential column conflict ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, these system metadata columns are used for fluss client to subcribe from a given timetime/offset. I hope it to be fixed now since most system's metadata column is fixed and fixed columns make it easy to understand.
I think we can make it congiurable in the future if we does found it help. It's a compatible change.

String.format(
"The table %s already exists in %s catalog, please "
+ "first drop the table in %s catalog.",
tablePath, dataLakeFormat, dataLakeFormat));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both drop existed table and suggest a new table name makes sense in this case, the later should be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest both of them in the message.

<!-- end exclude for lakehouse-paimon -->
<exclude>com.alibaba.fluss.lakehouse.cli.*</exclude>
<exclude>com.alibaba.fluss.kafka.*</exclude>
<exclde>com.alibaba.fluss.lake.paimon.FlussDataTypeToPaimonDataType</exclde>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a full types test in LakeEnabledTableCreateITCase is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add full types of fluss, but we have still to keep it in here since Fluss doesn't support array, map, row type. So the maxinum line coverage can only reach 65%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

@luoyuxia luoyuxia force-pushed the create-lake-table branch 4 times, most recently from f5ef0f4 to eea9c6a Compare April 17, 2025 11:53
CoreOptions.ChangelogProducer.INPUT.toString());
} else {
// for log table, need to set bucket, offset and timestamp
schemaBuilder.column(BUCKET_COLUMN_NAME, DataTypes.INT());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we need to check the original schema contains same system column name like __bucket or not, to avoid conflict with users' original columns.

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia the pull request looks good in general. I left some minor comments.

tableDescriptor.getSchema().getColumns()) {
String columnName = column.getName();
if (systemColumns.containsKey(columnName)) {
throw new InvalidTableException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue #810 to avoid creating table with system columns even if the table is not enabled lake.

@wuchong
Copy link
Member

wuchong commented Apr 30, 2025

Besides, could you rename the module fluss-lake-format-paimon to fluss-lake-paimon? Because we will introduce the tiering service in fluss-flink instead of fluss-lake/. So fluss-lake/ may only contain lake formats (e.g., fluss-lake-iceberg, fluss-lake-delta)

@luoyuxia luoyuxia force-pushed the create-lake-table branch 2 times, most recently from 8c14974 to baf72d7 Compare May 7, 2025 03:19
@luoyuxia
Copy link
Contributor Author

luoyuxia commented May 7, 2025

@wuchong Comments addressed

@luoyuxia luoyuxia force-pushed the create-lake-table branch from baf72d7 to d4887cb Compare May 9, 2025 08:05
@wuchong wuchong merged commit bd9e1c4 into apache:main May 10, 2025
4 checks passed
ZmmBigdata pushed a commit to ZmmBigdata/fluss that referenced this pull request Jun 20, 2025
polyzos pushed a commit to polyzos/fluss that referenced this pull request Aug 30, 2025
polyzos pushed a commit to Alibaba-HZY/fluss that referenced this pull request Aug 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Synchronously create Lake tables when Flusss lakehouse enabled during Fluss table creation

3 participants