[HUDI-6366] Prevent flink offline table service rerun completed instant by hbgstc123 · Pull Request #8950 · apache/hudi

hbgstc123 · 2023-06-13T02:26:30Z

If flink offline table service fail after commit the compaction/clustering instant, and a restart strategy is enable, then the completed instant will be rerun.

Consequence is, if the completed instant happen to be archived, there will be a file not found error; if not archived, then there will be duplicated base files.

Change Logs

check if the compaction/clustering instant is pending in active timeline before running flink offline table service

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-06-13T10:11:04Z

if not archived, then there will be duplicated base files.

How are these duplicates generated?

hbgstc123 · 2023-06-13T16:46:02Z

if not archived, then there will be duplicated base files.

How are these duplicates generated?

For compaction, if the first run compact fileID1_timestamp1.log and fileID1_0-1-0_timestamp1.parquet, genrate fileID1_0-1-0_timestamp2.parquet, the job fail after compaction committed, then job restart and rerun this compaction instant, this second run will again compact fileID1_timestamp1.log and fileID1_0-1-0_timestamp1.parquet, but genrate fileID1_0-1-1_timestamp2.parquet, then fail to complete because its already completed in the first run. These 2 files fileID1_0-1-0_timestamp2.parquet and fileID1_0-1-1_timestamp2.parquet are duplicated.

Similar for clustering but worse, because the second run will generate a parquet file with a new file ID, when you read from the table again the result will be wrong.

danny0405 · 2023-06-14T09:48:01Z

the job fail after compaction committed, then job restart and rerun this compaction instant, this second run will again compact fileID1_timestamp1.log and fileID1_0-1-0_timestamp1.parquet,

We should skip the completed instant, is that the behavior of current code then?

     // fetch the instant based on the configured execution sequence
      HoodieTimeline pendingCompactionTimeline = table.getActiveTimeline().filterPendingCompactionTimeline();

hbgstc123 · 2023-06-14T10:24:44Z

the job fail after compaction committed, then job restart and rerun this compaction instant, this second run will again compact fileID1_timestamp1.log and fileID1_0-1-0_timestamp1.parquet,

We should skip the completed instant, is that the behavior of current code then?
     // fetch the instant based on the configured execution sequence
      HoodieTimeline pendingCompactionTimeline = table.getActiveTimeline().filterPendingCompactionTimeline();

yes, but that is the client code, task fail and restart won't rerun that code.

danny0405

+1, nice catch ~

danny0405 · 2023-06-14T12:41:24Z

...e/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringPlanSourceFunction.java

  public void open(Configuration parameters) throws Exception {
-    // no operation
+    isPending = StreamerUtil.createMetaClient(conf).getActiveTimeline()
+        .getInstantsAsStream().anyMatch(i -> clusteringInstantTime.equals(i.getTimestamp()) && !i.isCompleted());


There is a method named containsInstant, can we use that?

Maybe we can also move the code into run so there is no need to keep a class member isPending.

yihua · 2023-06-14T23:38:17Z

Rebased this PR to fix build error.

hudi-bot · 2023-06-15T12:43:23Z

CI report:

3c3905b UNKNOWN
87974a6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hbgstc123 force-pushed the fix_rerun_complete_service_instant branch from 3c3905b to 379ab53 Compare June 13, 2023 02:40

danny0405 self-assigned this Jun 14, 2023

danny0405 added engine:flink Flink integration area:table-service Table services usability priority:medium Moderate impact; usability gaps labels Jun 14, 2023

danny0405 approved these changes Jun 14, 2023

View reviewed changes

danny0405 reviewed Jun 14, 2023

View reviewed changes

yihua force-pushed the fix_rerun_complete_service_instant branch from d9be799 to 79add0e Compare June 14, 2023 23:38

hbgstc123 force-pushed the fix_rerun_complete_service_instant branch from 79add0e to 102ae95 Compare June 15, 2023 06:43

hbg added 3 commits June 15, 2023 15:46

[HUDI-6366] Prevent flink offline table service rerun complete instant

87dde37

[HUDI-6366] move checking logic into run method

9c7e088

[HUDI-6366] move checking logic into run method

87974a6

hbgstc123 force-pushed the fix_rerun_complete_service_instant branch from 102ae95 to 87974a6 Compare June 15, 2023 07:47

yihua merged commit cc33f53 into apache:master Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6366] Prevent flink offline table service rerun completed instant#8950

[HUDI-6366] Prevent flink offline table service rerun completed instant#8950
yihua merged 3 commits intoapache:masterfrom
hbgstc123:fix_rerun_complete_service_instant

hbgstc123 commented Jun 13, 2023

Uh oh!

danny0405 commented Jun 13, 2023

Uh oh!

hbgstc123 commented Jun 13, 2023 •

edited

Loading

Uh oh!

danny0405 commented Jun 14, 2023

Uh oh!

hbgstc123 commented Jun 14, 2023

Uh oh!

danny0405 left a comment

Uh oh!

danny0405 Jun 14, 2023

Uh oh!

danny0405 Jun 14, 2023

Uh oh!

yihua commented Jun 14, 2023

Uh oh!

hudi-bot commented Jun 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hbgstc123 commented Jun 13, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 commented Jun 13, 2023

Uh oh!

hbgstc123 commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Jun 14, 2023

Uh oh!

hbgstc123 commented Jun 14, 2023

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

yihua commented Jun 14, 2023

Uh oh!

hudi-bot commented Jun 15, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hbgstc123 commented Jun 13, 2023 •

edited

Loading