[HUDI-4521] Ensure flink coordinator do re-commit during restart for changing write.tasks number (i.e., write parallelism) by YuweiXiao · Pull Request #6273 · apache/hudi

YuweiXiao · 2022-08-02T03:12:45Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Patch to ensure StreamWriteOperatorCoordinator do re-commit during restart when the write.tasks parameter changes.

For details, please check out HUDI-4521.

Brief change log

Persist write.tasks parameter (i.e., parallelism) in writers' ckp states.
Include parallelism and total event number in the bootstrap metadata event.
Use parallelism & total event number to determine whether or not to do the re-commit, rather than relying on the slot number (i.e., the current write.tasks setup).

Verify this pull request

Unit tests is added to verify the changes.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…nging write.tasks

danny0405 · 2022-08-02T07:09:21Z

...datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java

+    ValidationUtils.checkArgument(Arrays.stream(eventBuffer).allMatch(evt -> evt != null && evt.isBootstrap()));
+    List<WriteMetadataEvent> events = Arrays.stream(eventBuffer).filter(evt -> !evt.getInstantTime().equals("")).collect(Collectors.toList());
+    String instant = events.stream().map(WriteMetadataEvent::getInstantTime).reduce((a, b) -> a.equals(b) ? a : "").orElse("");
+    // instant and parallelism should be unique


The fix may be right but there are some thoughts about code engineering:

we should not put static config options in WriteMetadataEvent which is a dynamic metadata event from write task.

do we really need to care about the parallelism of last run ? Can we just merge the bootstrap events first before sending it to the coordinator ?

We have to persist the parallelism somewhere because it is used to validate if all flushes in the last run complete. And it may change across runs. So I choose the most straightforward way to store it (i.e., in metadata state). Another option is to let user to pass in parallelism of the last run through configs, which is not user-friendly as users need to understand the logic behind it.

Merging bootstrap events before sending it a good point! Will improve it and seems we could remove the transient variable mergeCount based on it.

We have to persist the parallelism somewhere because it is used to validate if all flushes in the last run complete. And it may change across runs

Did you notice that we send an empty bootstrap event even if there is nothing in the metadata state ? There are 2 cases that the metadata state is empty:

the first run of the app

the parallelism is tweaked as larger: say from 4 to 8

And in this case ,we still send bootstrap event to coordinator so that it can be used for validation.

Yes, and those empty event will be excluded as the instant time is empty.

Lets say we increase write.task from 4 -> 8. During restart, we will receive 8 metadata event (with 4 empty bootstrap event and potentially more if some tasks failed to flush in the last run). In order to do know if the remaining non-empty metadata events are the full collection of events from last run, we have to get the parallelism of the last run (i.e., 4) rather than checking using the new parallelism (i.e., 8)

The operator state backend would spread the events evenly within the write tasks, and i want to say that the empty bootstrap event is also valid for coordinator validation.

But here is a problem, when we decrease the parallelism, one write task may have several bootstrap events, we should merge these events first before sending them to coordinator, or the coordinator may commit eagerly before it receives all the bootstrap events.

Yes, so I add eventNumOfTask/mergeCount variables and change the eager commit behavior. Following your 'merge before send' suggestion, the mergeCount is no longer needed.

Yes, merge events before sending to coordinator makes the logic much simpler.

YuweiXiao · 2022-08-03T12:43:11Z

@hudi-bot run azure

hudi-bot · 2022-08-03T13:35:41Z

CI report:

99b468e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2022-08-04T03:44:05Z

...datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java

+      return Option.empty();
+    }
+
+    int totalNumOfMetadataStates = events.stream().mapToInt(WriteMetadataEvent::getNumOfMetadataState).sum();


We can handle the bootstrap event just like before because we already merge them before sending.

Before means? Here we have to validate number of events against parallelism of the last run.

we have to validate number of events against parallelism of the last run.

There is no need to validate it, the checkpoint mechanism would ensure the integrity of the metadata events.

Hey Danny, it is true that the validation is not necessary if we use the same write.task across runs. But it may not be the case when users change the write parallelism across runs.

For example, when the user reduces parallelism from 8 -> 4, the coordinator will receive 4 merged events with the merging optimization you proposed. To determine if we can do re-commit, we need to validate the events against 8 rather than only checking if all 4 event buffer are not empty. This is also way I persist parallelism to the writer state now since it may be used in the next run.

Not sure if there is better solution to this and the fix may not be helpful in most cases. But for usecases where the source purely relies on Flink ckp to determine the consumption offset, the mis-handling of re-commit during restart may cause lost of one batch data in the hudi side (or duplicate data).

And in the case where we increase the parallelism (e.g., 4 -> 8), empty bootstrap events will be sent by writers. Following the old logic, the re-commit may be skipped because the last message received by the coordinator may be an empty events rather than events with instant time.

we need to validate the events against 8 rather than only checking if all 4 event buffer are not empty

We do not need to because we know that each write task sends only exactly one event.

Hello, is there any progress for this pr ?

Hey, sorry for the late reply. Could you help me understand why the validation is not necessary? Among those events sent by writers, they could be empty or carrying metadata, which should be normal in changing parallelism case.

danny0405 · 2022-08-15T04:10:28Z

Hello, is there any progress for this pr ?

[HUDI-4521] ensure flink coordinator re-commit during restart for cha…

3a89ed4

…nging write.tasks

YuweiXiao force-pushed the HUDI-4521 branch from 801f948 to 3a89ed4 Compare August 2, 2022 06:06

danny0405 reviewed Aug 2, 2022

View reviewed changes

danny0405 added the priority:medium Moderate impact; usability gaps label Aug 3, 2022

danny0405 self-assigned this Aug 3, 2022

address review: merge before send to avoid extra mergeCount logic

99b468e

danny0405 reviewed Aug 4, 2022

View reviewed changes

danny0405 added engine:flink Flink integration writer-core labels Aug 15, 2022

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

xushiyan removed the writer-core label Dec 2, 2025

cshuo mentioned this pull request Dec 3, 2025

Fix lost re-commit in rare restart case (changing write.tasks) #15309

Open

Conversation

YuweiXiao commented Aug 2, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao commented Aug 3, 2022

Uh oh!

hudi-bot commented Aug 3, 2022

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danny0405 Aug 2, 2022 •

edited

Loading

YuweiXiao Aug 2, 2022 •

edited

Loading

YuweiXiao Aug 15, 2022 •

edited

Loading