[HUDI-5279] move logic for deleting active instant to HoodieActiveTimeline#7196
[HUDI-5279] move logic for deleting active instant to HoodieActiveTimeline#7196YannByron merged 6 commits intoapache:masterfrom
Conversation
e9b0ac6 to
eea310e
Compare
|
@hudi-bot run azure |
| Path inFlightCommitFilePath = getInstantFileNamePath(instant.getFileName()); | ||
| try { | ||
| if (metaClient.getFs().exists(inFlightCommitFilePath)) { | ||
| boolean result = metaClient.getFs().delete(inFlightCommitFilePath, false); |
There was a problem hiding this comment.
Seems the core change is renaming a method here: deleteInstantFileIfExists -> deleteInstantIfExists ? If that is true, i would suggest to keep as it is.
There was a problem hiding this comment.
no.
the major change is that, when need to delete active instant in archive operation, use HoodieActiveTimeline's deleteInstantIfExists instead of calling the naked api (fs.delete). After the changes, HoodieActiveTimeline becomes the standard route to operate active instants.
For this above, merge the origin method deleteInstantFile (used in current class and subclass) and deleteInstantFileIfExists (used outside) to one, and rename to deleteInstantIfExists.
There was a problem hiding this comment.
calling the naked api (fs.delete).
deleteInstantFile(HoodieInstant instant) also takes a HoodieInstant param, so what's the difference here ?
There was a problem hiding this comment.
the old way to delete active instants when archive uses deleteArchivedInstantFiles -> deleteFilesParallelize in HoodieTimelineArchiver. and deleteFilesParallelize calls fs.delete() directly.
| public static <T, R> Map<T, R> parallelizeProcess( | ||
| HoodieEngineContext hoodieEngineContext, | ||
| int parallelism, | ||
| SerializableFunction<T, R> pairFunction, |
There was a problem hiding this comment.
Seems a general method for parallel processing, should not put it in FSUtils, maybe we can add a similar tool method as HoodieTimelineArchiver#deleteFilesParallelize for hoodie instants, for example HoodieTimelineArchiver#deleteInstantsParallelize
| if (result) { | ||
| LOG.info("Removed instant " + instant); | ||
| } else { | ||
| throw new HoodieIOException("Could not delete instant " + instant); |
There was a problem hiding this comment.
The method save one invocation for fs.exists, let's keep it.
1145ecb to
47f4bd8
Compare
|
@hudi-bot run azure |
this CI has passed. @XuQianJin-Stars |
There was a problem hiding this comment.
a few notes:
- at timeline API level, we should not allow ignoring error; this affects data integrity
- there is another API
org.apache.hudi.common.table.timeline.HoodieActiveTimeline#deleteInstantFile(org.apache.hudi.common.table.timeline.HoodieInstant)to be consolidated/dedup'ed - pls file jira as this touches critical code path, also properly fill the PR template "Impact" and "Risk" sections
- pls help increase UT coverage for critical APIs wherever applicable
| success &= result.getValue(); | ||
| Map<HoodieInstant, Boolean> result = context.mapToPair( | ||
| instants, | ||
| instant -> ImmutablePair.of( |
There was a problem hiding this comment.
hide implementation: use Pair.of()
| boolean success = true; | ||
| for (Map.Entry<HoodieInstant, Boolean> entry : result.entrySet()) { | ||
| LOG.info("Archived and deleted instant " + entry.getKey().toString() + " : " + entry.getValue()); | ||
| success &= entry.getValue(); |
There was a problem hiding this comment.
you can chain this with result using stream API allMatch()
There was a problem hiding this comment.
if throw error directly if fail to delete, no return value is required.
| deleteInstantFileIfExists(instant, true); | ||
| } | ||
|
|
||
| public boolean deleteInstantFileIfExists(HoodieInstant instant, boolean exceptionIfFailToDelete) { |
There was a problem hiding this comment.
| public boolean deleteInstantFileIfExists(HoodieInstant instant, boolean exceptionIfFailToDelete) { | |
| public boolean deleteInstantFileIfExists(HoodieInstant instant, boolean throwIfFailed) { |
There was a problem hiding this comment.
in what case do we want to silence the failure? this relates to timeline integrity so we should prefer to fail out loud
There was a problem hiding this comment.
I think no cases will silence this. https://github.com/apache/hudi/pull/7196/files#r1032441151
| Path inFlightCommitFilePath = getInstantFileNamePath(instant.getFileName()); | ||
| Path commitFilePath = getInstantFileNamePath(instant.getFileName()); |
There was a problem hiding this comment.
we need to be careful about what commit can be deleted. this API is designed to only delete requested/inflight or empty clean commit, based on the usage. (1 exception is in org.apache.hudi.cli.commands.TestRepairsCommand#testShowFailedCommits where this api is misused for deleting completed commits in tests; we should make separate test helper for that)
We can't delete completed commit instants, which breaks the timeline's integrity. So we should name it properly at the variable as well as the API level.
There was a problem hiding this comment.
then we should add some doc for this method. inFlightCommitFilePath doesn't represent requested and empty clean instant.
| instants, | ||
| instant -> ImmutablePair.of( | ||
| instant, | ||
| metaClient.getActiveTimeline().deleteInstantFileIfExists(instant, false)), |
There was a problem hiding this comment.
the original invocation is ignoreFailure=false; you inverted the flag here by saying exceptionIfFailToDelete=false, which silence the error
There was a problem hiding this comment.
I was wrong about that. I thought the original logic was the ignoreFailure=false case. Will fix this.
i thought there was a case that need to ignore this error. Will correct this.
I know that and have did this. But Danny suggests that make this pr force on one question.
file a jira. but after this pr is updated, there is almost no risk and impact.
the existing UT is enough to cover this improvement. |
47f4bd8 to
619effa
Compare
| private void deleteArchivedInstants(HoodieEngineContext context, | ||
| HoodieTableMetaClient metaClient, | ||
| List<HoodieInstant> instants) { | ||
| if (!instants.isEmpty()) { | ||
| context.foreach( | ||
| instants, | ||
| instant -> metaClient.getActiveTimeline().deleteInstantFileIfExists(instant), | ||
| Math.min(instants.size(), config.getArchiveDeleteParallelism()) |
There was a problem hiding this comment.
no need to use a separate method: this method overload the other method, only signature changes that confuses people about which calls which. This is just 1 line invocation. you just call context.foreach() once for the 2 lists together by chaining them.
There was a problem hiding this comment.
ok, not found there is another method with same name.
3b1e398 to
7f6d467
Compare
|
@hudi-bot run azure |
43684b3 to
a7d0518
Compare
This reverts commit 4a760ca.
Change Logs
Currently, only the logic for deleting active instant is out of
HoodieActiveTimeline.I think it's better that all the operations to active instant are concentrated to
HoodieActiveTimeline.Impact
none
Risk level (write none, low medium or high below)
none
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist