Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa... #7997

Closed
wants to merge 2 commits into from

Conversation

voonhous
Copy link
Member

@voonhous voonhous commented Feb 20, 2023

...ndle

Applying the fix from #5185 will fix write issues for MOR tables, but will cause write issues for COW tables.

More information on how to reproduce the COW error in this jira issue:
https://issues.apache.org/jira/browse/HUDI-5822

This addresses the issue raised here: #5782

Change Logs

  1. Complement the change of getLatestFileSlices -> getAllFileGroups in [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index  #5185

Impact

  1. Allow jobs to recover properly
  2. Ensure that the correct fileSlices are being read

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@voonhous voonhous marked this pull request as draft February 20, 2023 13:00
@voonhous voonhous marked this pull request as ready for review February 20, 2023 13:54
@voonhous voonhous changed the title Fix FileId not found exception when FileId is passed to HoodieMergeHa… [HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa… Feb 20, 2023
@voonhous voonhous marked this pull request as draft February 20, 2023 13:56
@voonhous voonhous changed the title [HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa… [HUDI-5822] Fix write and read correctness issue when a rollback is performed Feb 21, 2023
@bvaradar bvaradar added the pr:wip Work in Progress/PRs label Feb 22, 2023
@bvaradar bvaradar self-assigned this Feb 22, 2023
@voonhous voonhous changed the title [HUDI-5822] Fix write and read correctness issue when a rollback is performed [HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa Feb 22, 2023
@voonhous voonhous changed the title [HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa [HUDI-5822] Fix FileId not found exception when FileId is passed to HoodieMergeHa... Feb 22, 2023
@voonhous voonhous marked this pull request as ready for review February 22, 2023 07:50
@voonhous voonhous requested review from bvaradar and SteNicholas and removed request for bvaradar and SteNicholas February 22, 2023 08:04
@SteNicholas
Copy link
Member

@voonhous, could you please add the test cases for the changes?

@voonhous
Copy link
Member Author

@voonhous, could you please add the test cases for the changes?

Hmmm, not easy to write a test for this. This bug can only be triggered if JM's rollback is running behind TM.

@voonhous
Copy link
Member Author

@hbgstc123 @TengHuo for visibility

@SteNicholas
Copy link
Member

@voonhous, could you add the unit tests for bootstrapIndexIfNeed?

@voonhous
Copy link
Member Author

@SteNicholas @danny0405 Can you please help to review this commit? Thank you

@voonhous
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@@ -115,16 +115,24 @@ protected HoodieRollbackRequest getRollbackRequestForAppend(String markerFilePat
// TODO(HUDI-1517) use provided marker-file's path instead
Option<HoodieLogFile> latestLogFileOption = FSUtils.getLatestLogFile(table.getMetaClient().getFs(), partitionPath, fileId,
HoodieFileFormat.HOODIE_LOG.getFileExtension(), baseCommitTime);

// Log file can be deleted if the commit to rollback is also the commit that created the fileGroup
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 here

Copy link
Contributor

@danny0405 danny0405 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, seems here is the code you wanna folk from the listing based rollback strategy:

              // In case all data was inserts and the commit failed, delete the file belonging to that commit
              // We do not know fileIds for inserts (first inserts are either log files or base files),
              // delete all files for the corresponding failed commit, if present (same as COW)
              hoodieRollbackRequests.add(getHoodieRollbackRequest(partitionPath, filesToDelete));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

@danny0405 danny0405 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this PR into 2, one is for the marker based rollback fix, another is for the testing of flink stream writer with hashing index enabled? I mean we create two JIRA issues here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR: #8077

@danny0405
Copy link
Contributor

Close because it should be fixed in #8077

@danny0405 danny0405 closed this Mar 20, 2023
@voonhous
Copy link
Member Author

@danny0405 This is not fixed... The rollback standardises the rollback logic between markers and listing for MOR tables.

The bug is still present in COW tables... TM might run ahead of of JM. If we continue to use getAllFileGroups, TM might still be able to fetch the fileGroup that is destined to be deleted, causing the fileId/Group not found in issue.

@danny0405
Copy link
Contributor

TM might run ahead of of JM

How could this happen? The write tasks on JM would wait for the JM coordinator to finish initialization then start to hande the writiing process.

@voonhous
Copy link
Member Author

voonhous commented Mar 20, 2023

JM

2023-02-20 11:36:34,826 INFO  org.apache.hudi.client.BaseHoodieWriteClient                 [] - Begin rollback of instant 20230220112929727
2023-02-20 11:36:34,833 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading HoodieTableMetaClient from hdfs://hudi_path
2023-02-20 11:36:34,947 INFO  org.apache.hudi.common.table.HoodieTableConfig               [] - Loading table properties from hdfs://hudi_path/.hoodie/hoodie.properties
2023-02-20 11:36:34,952 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://hudi_path
2023-02-20 11:36:34,952 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading Active commit timeline for hdfs://hudi_path
2023-02-20 11:36:35,320 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220112929727__commit__INFLIGHT]}
2023-02-20 11:36:35,321 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating View Manager with storage type :REMOTE_FIRST
2023-02-20 11:36:35,321 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating remote first table view
2023-02-20 11:36:35,323 INFO  org.apache.hudi.client.BaseHoodieWriteClient                 [] - Scheduling Rollback at instant time : 20230220113634829 (exists in active timeline: true), with rollback plan: false
2023-02-20 11:36:35,612 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220113634829__rollback__REQUESTED]}
2023-02-20 11:36:35,612 INFO  org.apache.hudi.table.action.rollback.BaseRollbackPlanActionExecutor [] - Requesting Rollback with instant time [==>20230220113634829__rollback__REQUESTED]
2023-02-20 11:36:35,620 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220113634829__rollback__REQUESTED]}
2023-02-20 11:36:35,694 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Checking for file exists ?hdfs://hudi_path/.hoodie/20230220113634829.rollback.requested
2023-02-20 11:36:35,706 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Create new file for toInstant ?hdfs://hudi_path/.hoodie/20230220113634829.rollback.inflight
2023-02-20 11:36:35,709 INFO  org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor [] - Clean out all base files generated for commit: [==>20230220112929727__commit__INFLIGHT]
2023-02-20 11:36:35,720 INFO  org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor [] - Time(in ms) taken to finish rollback 11
2023-02-20 11:36:35,720 INFO  org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rolled back inflight instant 20230220112929727
2023-02-20 11:36:35,721 INFO  org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Index rolled back for commits [==>20230220112929727__commit__INFLIGHT]
2023-02-20 11:36:35,725 INFO  org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Deleting instant=[==>20230220112929727__commit__INFLIGHT]
2023-02-20 11:36:35,725 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Deleting instant [==>20230220112929727__commit__INFLIGHT]
2023-02-20 11:36:35,728 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Removed instant [==>20230220112929727__commit__INFLIGHT]
2023-02-20 11:36:35,728 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Deleting instant [==>20230220112929727__commit__REQUESTED]
2023-02-20 11:36:35,731 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Removed instant [==>20230220112929727__commit__REQUESTED]
2023-02-20 11:36:35,732 INFO  org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Deleted pending commit [==>20230220112929727__commit__REQUESTED]
2023-02-20 11:36:35,733 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Checking for file exists ?hdfs://hudi_path/.hoodie/20230220113634829.rollback.inflight
2023-02-20 11:36:35,786 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Create new file for toInstant ?hdfs://hudi_path/.hoodie/20230220113634829.rollback
2023-02-20 11:36:35,786 INFO  org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor [] - Rollback of Commits [20230220112929727] is complete
2023-02-20 11:36:35,805 INFO  org.apache.hudi.common.fs.FSUtils                            [] - Removed directory at hdfs://hudi_path/.hoodie/.temp/20230220112929727
2023-02-20 11:36:35,806 INFO  org.apache.hudi.metrics.HoodieMetrics                        [] - Sending rollback metrics (duration=973, numFilesDeleted=2)
2023-02-20 11:36:35,812 INFO  org.apache.hudi.client.BaseHoodieWriteClient                 [] - Generate a new instant time: 20230220113635812 action: commit
2023-02-20 11:36:35,815 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[20230220113634829__rollback__COMPLETED]}

TM

 11:36:33,837 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://hdfs_path
2023-02-20 11:36:33,837 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading Active commit timeline for hdfs://hdfs_path
2023-02-20 11:36:33,840 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220112929727__commit__INFLIGHT]}
2023-02-20 11:36:33,841 INFO  org.apache.hudi.client.BaseHoodieClient                      [] - Embedded Timeline Server is disabled. Not starting timeline service
2023-02-20 11:36:33,843 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating View Manager with storage type :REMOTE_FIRST
2023-02-20 11:36:33,843 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating remote first table view
2023-02-20 11:36:33,849 INFO  org.apache.hudi.common.table.view.AbstractTableFileSystemView [] - Took 2 ms to read  0 instants, 0 replaced file groups
2023-02-20 11:36:33,850 INFO  org.apache.hudi.sink.common.AbstractStreamWriteFunction      [] - Send bootstrap write metadata event to coordinator, task[0].
2023-02-20 11:36:33,850 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220112929727__commit__INFLIGHT]}
2023-02-20 11:36:33,853 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - bucket_write: dim_buyer_info_test (1/2)#0 (e2e7a38d69393b8d814ad48544573435_829468138015e9cb689e833f1800885e_0_0) switched from INITIALIZING to RUNNING.
2023-02-20 11:36:33,857 INFO  org.apache.hudi.sink.CleanFunction                           [] - Executor executes action [wait for cleaning finish] success!
2023-02-20 11:36:33,860 INFO  org.apache.hudi.sink.bucket.BucketStreamWriteFunction        [] - Loading Hoodie Table dim_buyer_info_test, with path hdfs://hdfs_path/age=0
2023-02-20 11:36:33,860 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading HoodieTableMetaClient from hdfs://hdfs_path
2023-02-20 11:36:33,867 INFO  org.apache.hudi.common.table.HoodieTableConfig               [] - Loading table properties from hdfs://hdfs_path/.hoodie/hoodie.properties
2023-02-20 11:36:33,868 INFO  org.apache.hudi.common.util.ClusteringUtils                  [] - Found 0 files in pending clustering operations
2023-02-20 11:36:33,868 INFO  org.apache.hudi.common.table.view.AbstractTableFileSystemView [] - Building file system view for partition (age=1)
2023-02-20 11:36:33,872 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://hdfs_path
2023-02-20 11:36:33,872 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading Active commit timeline for hdfs://hdfs_path
2023-02-20 11:36:33,878 INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline   [] - Loaded instants upto : Option{val=[==>20230220112929727__commit__INFLIGHT]}
2023-02-20 11:36:33,879 INFO  org.apache.hudi.common.table.view.AbstractTableFileSystemView [] - addFilesToView: NumFiles=2, NumFileGroups=1, FileGroupsCreationTime=6, StoreTimeTaken=1
2023-02-20 11:36:33,880 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating View Manager with storage type :REMOTE_FIRST
2023-02-20 11:36:33,880 INFO  org.apache.hudi.common.table.view.FileSystemViewManager      [] - Creating remote first table view
2023-02-20 11:36:33,880 INFO  org.apache.hudi.sink.bucket.BucketStreamWriteFunction        [] - bootstrapIndexIfNeed with timeline: [[==>20230220112929727__commit__INFLIGHT]]
2023-02-20 11:36:33,880 INFO  org.apache.hudi.common.table.HoodieTableMetaClient           [] - Loading HoodieTableMetaClient from hdfs://hdfs_path
2023-02-20 11:36:33,880 INFO  org.apache.hudi.sink.bucket.BucketStreamWriteFunction        [] - Should load this partition bucket 0 with fileID 00000000-ee86-4b41-a704-9e075dd253d8
2023-02-20 11:36:33,880 INFO  org.apache.hudi.sink.bucket.BucketStreamWriteFunction        [] - Adding fileID 00000000-ee86-4b41-a704-9e075dd253d8 to the bucket 0 of partition age=1. 

Explanation

TM's BucketStreamWriteFunction#processElement is executing at 2023-02-20 11:36:33,880, while JM completed the rollback at 2023-02-20 11:36:35,815 (2 seconds after TM's BucketStreamWriteFunction#processElement)

Hope this log paints a clearer picture.

@voonhous
Copy link
Member Author

voonhous commented Mar 21, 2023

@danny0405 Running the tests included with this PR without the fix fails... Hence, I believe the fix in this PR is still required.

@danny0405 danny0405 reopened this Mar 21, 2023
@danny0405
Copy link
Contributor

@danny0405 Running the tests included with this PR without the fix fails... i.e. The fix in this PR is still required.

Got ya, can you rebase with the latest master and force-push again

@voonhous
Copy link
Member Author

@danny0405 Running the tests included with this PR without the fix fails... i.e. The fix in this PR is still required.

Got ya, can you rebase with the latest master and force-push again

Rebased and force-pushed.

@danny0405
Copy link
Contributor

@hudi-bot run azure

1 similar comment
@danny0405
Copy link
Contributor

@hudi-bot run azure

@danny0405
Copy link
Contributor

@voonhous The CI can not be triggered, can you fire another PR instead.

@voonhous voonhous marked this pull request as draft March 22, 2023 02:47
@voonhous voonhous marked this pull request as ready for review March 22, 2023 02:48
@voonhous voonhous closed this Mar 22, 2023
@voonhous
Copy link
Member Author

@voonhous The CI can not be triggered, can you fire another PR instead.

#8263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:wip Work in Progress/PRs
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants