New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-571] Add 'commits show archived' command to CLI #1274
Conversation
@satishkotha can you please look into the failing build ? |
@n3nash Fixed now, please take a look |
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level I'm okay with the approach, will review more deeply by tomorrow
archivedTimeline = new HoodieArchivedTimeline(this); | ||
} | ||
return archivedTimeline; | ||
public synchronized HoodieArchivedTimeline getArchivedTimeline(String startTs, String endTs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to remove if (archivedTimeline == null)
check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
archivedTimeline as instance variable does not make sense because we are creating archive timeline for small time window (depending on command line arguments). So we will create mutiple archivedTimelines when someone is paginating through archived commit files. null check prevents us from doing that.
In the long term, when we have lazy loading, we can consider creating one instance of ArchivedTimeline that stores all metadata. we can bring back instance variable at that time. Until we have that, this does not make sense.
Let me know if you think there is a better way to organize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the instance variable loaded lazily only when getArchivedTimeline() is called in which case subsequent cases can use that instance variable. I think the concern here is that the only method we expose now is getArchivedTimeline(String startTs, String endTs) and caching the instance variable from that doesn't make sense as you are saying as well. I'm wondering if we should keep an instance variable and keep the same method as before getArchivedTimeline() and then the constructor below is also light. Finally, we expose a method on the archived timeline to filter(startTs, endTs). This way we don't have to create multiple objects for each window invocation. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wanted to do that. But HoodieDefaultTimeline(base class) keeps track of instants in private class variable. And we do all filtering operations on top of that. So, if we reuse same HoodieArchivedTimeline object, we will likely have to change structure of how instants are kept in memory for base class. I can talk to you more in person tomorrow. But right now, i prefer creating multiple instances of archived timeline because its less error prone IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking in person, changed behavior to create single instance of "HoodieArchivedTimeline" and load metadata for all commits.
Note that this means we read all archived files first. Then do a second pass for details of commits in specific time range. This increases overall time taken by first command. In the example dataset, it took ~20 minutes for initial metadata to load.
Then subsequent commands are few seconds each.
With previous approach we only do one pass on files. All commands are few seconds each.
So I think we need to improve metadata to reduce time taken by first step with new approach.
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTimeline.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, took a second pass, left some more comments, once addressed we should be able to merge this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@satishkotha left 1 high level comment on the structuring of the constructor, if we can come to a conclusion on that, will merge as rest looks good.
@CliOption(key = {"headeronly"}, help = "Print Header Only", unspecifiedDefaultValue = "false") | ||
final boolean headerOnly) | ||
throws IOException { | ||
if (StringUtils.isNullOrEmpty(startTs)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a defaultValue during the cliOption itself so you don't have to use these checks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLIOption attribute can only be assigned a static value. i think default static value is not that friendly. I chose to pick '10 days before today' which is likely to be commonly used. Let me know if you prefer default constant date instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that makes sense
loadInstants(startTs, endTs); | ||
} | ||
|
||
public void removeInstantDetailsFromMemory(String startTs, String endTs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : s/removeInstantDetailsFromMemory/clearInstantDetailsFromMemory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, renamed
/** | ||
* Get all instants (commits, delta commits) that produce new data, in the active timeline. | ||
*/ | ||
public HoodieTimeline getCommitsTimeline() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar can you ack that this is okay to move to this class ? This way the ArchivedTimeline can also use these methods (and is aligned with our thoughts on having the same operations on archived and active timeline..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar just pinging to resurface this since it might be lost in the large number of notifications :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash can't think of anything top of my mind.. Should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@satishkotha 2 minor comments, rest looks good to me, can merge after it's addressed. Also, once you've addressed the comments, could you please left a comment on the PR that you've addressed it ?
Addressed comments and made some more improvements. Please take a look |
@satishkotha one conflict, please rebase and then I can merge this |
ff43a04
to
596f329
Compare
@n3nash Fixed conflict. Note that test has been moved from hudi-common to hudi-client because sequence file based writing for archive log is no longer supported. Please take a look after Travis completes building |
@@ -135,6 +135,7 @@ | |||
/** | |||
* Total number of rollback blocks seen in a compaction operation. | |||
*/ | |||
@Nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid this change as part of this diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Without this we are not able to read certain archived commits
@@ -290,7 +291,7 @@ public long getTotalRollbackBlocks() { | |||
return totalRollbackBlocks; | |||
} | |||
|
|||
public void setTotalRollbackBlocks(Long totalRollbackBlocks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we can address this as part of some other refactoring diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Without this we are not able to read certain archived commits
10f6622
to
2c94b17
Compare
Hi guys, it seems that there 's a little problem with the regex pattern Just raise a PR #4521 trying to fix it. Wish you're interested and help me review? Thanks a lot. |
What is the purpose of the pull request
Add command to show archived commits. This is useful for debugging historical timeline.
Brief change log
Other items planned (to be part of later diffs):
Verify this pull request
This pull request is already covered by existing tests, such as:
`hoodie:$dataset->commits show archived