Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: Fixes flink sink failed due to updating partition spec #7171

Merged
merged 8 commits into from
May 18, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Mar 22, 2023

We use a SerializableTable instance to create IcebergStreamWriter. The PartitionSpec of SerializableTable is fixed and will not change after the job started. While the PartitionSpec for IcebergFilesCommitter is refreshed with the table snapshot changing. This could fail the fink sink job when updating the partition spec in another job. Because we use the wrong partition spec to write those DataFiles/DeleteFiles to ManifestFile.

For example, we got the following error when updating the partition spec:

image

In this patch, we use the correct partition spec to write the staging manifest file.

@github-actions github-actions bot added the flink label Mar 22, 2023
@ConeyLiu
Copy link
Contributor Author

cc @chenjunjiedada @stevenzwu @rdblue Pls take a look when you are free. Thanks a lot.

@zinking
Copy link
Contributor

zinking commented Mar 27, 2023

@szehon-ho @stevenzwu can we get this merged ?

@ConeyLiu
Copy link
Contributor Author

Thanks @hililiwei, comment has been addressed. Pls take another look.
Also CC @jackye1995 @nastra @Fokko, could you take a look when you are free? Thanks in advance

@stevenzwu stevenzwu self-requested a review April 10, 2023 02:03
@stevenzwu
Copy link
Contributor

We use the current PartitionSpec for IcebergStreamWriter, which is fixed and will not change after the job started. While the PartitionSpec for IcebergStreamWriter is refreshed with the table snapshot changing.

@ConeyLiu is there a typo in the above descriptions. seems contradicting each other.

@ConeyLiu
Copy link
Contributor Author

@stevenzwu I'm sorry for the mistake. Just updated the descriptions.

@ConeyLiu ConeyLiu changed the title Flink: IcebergFilesCommitter should use same PartitionSpec as the IcebergStreamWriter Flink: Fixes flink sink failed due to updating partition spec May 8, 2023
@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented May 8, 2023

Thanks @stevenzwu for the review, and I am sorry for the later response due to some things to do. I updated the fixes implementation. Please take another look when you are free, thanks a lot.


specId =
getStagingManifestSpecId(harness.getOperator().getOperatorStateBackend(), checkpointId);
Assert.assertEquals(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the staging manifest files are written with old partition spec.

@chenjunjiedada
Copy link
Collaborator

I am wondering if we should just pass the same read-only SerializableTable to IcebergFilesCommitter so that it also use the same table spec as the IcebergStreamingWriter/RowDataTaskWriterFactory.

@stevenzwu, We also have a requirement to migrate the table without restarting the Flink job since users may have thousands of production streaming jobs online. Right now, I don't have a full solution in my mind, the early thinking is to notify the task manager to update the writer after checkpoint. Do you have a such kind requirement as well? Any idea?

@stevenzwu
Copy link
Contributor

@stevenzwu, We also have a requirement to migrate the table without restarting the Flink job since users may have thousands of production streaming jobs online. Right now, I don't have a full solution in my mind, the early thinking is to notify the task manager to update the writer after checkpoint. Do you have a such kind requirement as well? Any idea?

@chenjunjiedada we probably can take this discussion in a separate issue. I remember some previous ask in this area about handling table schema evolution without manual intervention. I couldn't seem to find the PR or issue. there are two slightly different asks.

  1. table schema is already updated/synced via external mechanism (like control plane). Just need the writer and committer to pick up the latest schema (or partition spec) without job restart.
  2. need writer to detect table schema is out of sync with the record schema. automatically update the table schema and write with latest schema.

case 1 can be implemented with resolving the write schema (or partition spec) not during job initialization, rather during task initialization. writers periodically check (e.g. every checkpoint cycle) if table schema or partition spec changed. if changed, writers can fail the job. Restart and task initialization will load the latest schema and spec. However, it does bring scalability concern because every writer task (hundreds or more) need to load a Iceberg table from catalog to retrieve the schema and partition spec.

Case 2 can be implemented similarly. But it is more risky. if bad records (schema) can cause unintended change in Iceberg table schema.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very close now. the latest approach seems cleaner. left a few nit comments

@stevenzwu stevenzwu merged commit 4e87cff into apache:master May 18, 2023
13 checks passed
@stevenzwu
Copy link
Contributor

thanks @ConeyLiu for the fix and @Reo-LEI and @hililiwei for the review

@ConeyLiu
Copy link
Contributor Author

Thanks all. Will submit backport PRs.

@ConeyLiu ConeyLiu deleted the fixes-flink-sink-failed branch May 19, 2023 05:00
ConeyLiu added a commit to ConeyLiu/iceberg that referenced this pull request May 22, 2023
ConeyLiu added a commit to ConeyLiu/iceberg that referenced this pull request May 22, 2023
ConeyLiu added a commit to ConeyLiu/iceberg that referenced this pull request May 23, 2023
stevenzwu pushed a commit that referenced this pull request May 23, 2023
Co-authored-by: xianyangliu <xianyangliu@tencent.com>
stevenzwu pushed a commit that referenced this pull request May 23, 2023
Co-authored-by: xianyangliu <xianyangliu@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants