New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues #4716
Conversation
62b022e
to
059ba5b
Compare
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
Show resolved
Hide resolved
@@ -264,17 +265,6 @@ private static void processRollbackMetadata(HoodieActiveTimeline metadataTableTi | |||
partitionToAppendedFiles.get(partition).merge(new Path(path).getName(), size, fileMergeFn); | |||
}); | |||
} | |||
|
|||
if (pm.getWrittenLogFiles() != null && !pm.getWrittenLogFiles().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressing HUDI-3322
@hudi-bot run azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my understanding of the fix.
We fix the rollback plan generated in ListingBasedStrategy to just include the log files with full sizes.
For Marker bases strategy, we set file sizes to -1, but fixed the right set of log files to be included in rollback plan.
So, we still have an issue w/ how we reconcile or merge multiple metadata records.
For eg:
Rec1: file1 delta size 100 (commit1)
Rec2: file1 deltasize 200 (commit2)
Rec3: file1 full size 350 (rollback)
when we merge all these 3 records from metadata table, whats the final resolved record look like ?
wrt MDT bootstrap, we ensure to trigger bootstrap before starting any operation.
...-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java
Show resolved
Hide resolved
...-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java
Show resolved
Hide resolved
@@ -38,14 +38,6 @@ | |||
"type": "long", | |||
"doc": "Size of this file in bytes" | |||
} | |||
}], "default":null }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may not be backwards compatible while reading rollback metadata written w/ 0.10.0 or previous versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've researched this a bit since my intuition was telling that fields removal is BWC change in Avro (since Avro also supports forward compatibility, when you essentially read data with old schema that was produced with a new one, which is essentially the same case)
And seems like deletion is BWC in Avro:
https://docs.confluent.io/platform/current/schema-registry/avro.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excerpt from Avro's docs:
https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, interesting. Can you try this out explicitly. write an avro using master branch. and then try to read it using this branch. or you can just try it out using a stand alone java main class too, which ever works.
just wanted to ensure we are good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done w/ one pass.
If you don't mind, can we add tests for the fix in this patch. The tests should fail if not the fix and should pass w/ the fix. |
Added assertions that only latest log-file could have been modifed by the instant that is being rolled back
Removed invalid assertion
… blocks to be reverted
Adding comments
…to make sure it's not ingesting intermediate step upon bootstrapping
…the one blocks have been appended to
bb05ceb
to
550d8e3
Compare
When we do a rollback the value in the plan (carrying the mapping of path to size) is only used to determine whether we should append Rollback Block or delete files. After we actually appended the Rollback Block, we now only modify the record related to the file we've appended the block to (previously we would also update all the log-files from the
There are already tests covering this, which were failing in #4556 (which is how i come to fix this issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just 1 comment on the avro BWC. once verified, we are good to go ahead.
@nsivabalan here's the test:
For the following schema:
Working as expected |
cool. thanks! |
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
…es (apache#4716) This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records. There are multiple issues that were leading to that: - [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those. - [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed. This change will unblock Stack of PRs based off apache#4556
Tips
What is the purpose of the pull request
This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.
There are multiple issues that were leading to that:
This change will unblock Stack of PRs based off #4556
Brief change log
writtenLogFilesSize
payloadHoodieWriteStat
w/inListingBasedRollbackRequest
RollbackPlan
generationVerify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
This PR fixing following tests that started to fail after changes in #4556:
TestHoodieSparkMergeOnReadTableRollback#testMORTableRestore
TestHoodieSparkMergeOnReadTableRollback#testRollbackWithDeltaAndCompactionCommit
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.