Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-571] Add 'commits show archived' command to CLI #1274

Merged
merged 1 commit into from Feb 5, 2020

Conversation

satishkotha
Copy link
Member

@satishkotha satishkotha commented Jan 23, 2020

What is the purpose of the pull request

Add command to show archived commits. This is useful for debugging historical timeline.

Brief change log

  • There is already 'show archived commits' command, but the output is not at all useful. Refactor common timeline methods to HoodieDefaultTimeline to reuse between Active and Archived timeline to make archived commits look similar to active commits
  • Modify tests to use archived timeline instead of reading files explicitly

Other items planned (to be part of later diffs):

  • Modify all other commands to have 'archived' command. 'savepoints show archived' etc
  • There is no pagination/lazy loading support and if a really long time range is specified, this can result in OOM. To keep the diff small, I'm sending this one first. After I get feedback, i can send another diff to support lazy loading (or pagination).

Verify this pull request

This pull request is already covered by existing tests, such as:

  • TestHoodieCommitArchiveLog.java
  • Manually verified the change by running CLI locally. Example output:
    `hoodie:$dataset->commits show archived

| CommitTime    | Total Bytes Written| Total Files Added| Total Files Updated| Total Partitions Written| Total Records Written| Total Update Records Written| Total Errors|
|=========================================================================================================================================================================|
| 20200113224332| 45.7 GB            | 0                | 86                 | 4                       | 97806475             | 107538                      | 0           |
| 20200113223441| 840.5 MB           | 0                | 1618               | 258                     | 1214615              | 15512                       | 0           |
| 20200113223441| 840.5 MB           | 0                | 1618               | 258                     | 1214615              | 15512                       | 0           |`

@n3nash n3nash self-assigned this Jan 23, 2020
@n3nash
Copy link
Contributor

n3nash commented Jan 23, 2020

@satishkotha can you please look into the failing build ?

@satishkotha
Copy link
Member Author

@satishkotha can you please look into the failing build ?

@n3nash Fixed now, please take a look

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level I'm okay with the approach, will review more deeply by tomorrow

archivedTimeline = new HoodieArchivedTimeline(this);
}
return archivedTimeline;
public synchronized HoodieArchivedTimeline getArchivedTimeline(String startTs, String endTs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to remove if (archivedTimeline == null) check ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archivedTimeline as instance variable does not make sense because we are creating archive timeline for small time window (depending on command line arguments). So we will create mutiple archivedTimelines when someone is paginating through archived commit files. null check prevents us from doing that.

In the long term, when we have lazy loading, we can consider creating one instance of ArchivedTimeline that stores all metadata. we can bring back instance variable at that time. Until we have that, this does not make sense.

Let me know if you think there is a better way to organize this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the instance variable loaded lazily only when getArchivedTimeline() is called in which case subsequent cases can use that instance variable. I think the concern here is that the only method we expose now is getArchivedTimeline(String startTs, String endTs) and caching the instance variable from that doesn't make sense as you are saying as well. I'm wondering if we should keep an instance variable and keep the same method as before getArchivedTimeline() and then the constructor below is also light. Finally, we expose a method on the archived timeline to filter(startTs, endTs). This way we don't have to create multiple objects for each window invocation. WDYT ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to do that. But HoodieDefaultTimeline(base class) keeps track of instants in private class variable. And we do all filtering operations on top of that. So, if we reuse same HoodieArchivedTimeline object, we will likely have to change structure of how instants are kept in memory for base class. I can talk to you more in person tomorrow. But right now, i prefer creating multiple instances of archived timeline because its less error prone IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking in person, changed behavior to create single instance of "HoodieArchivedTimeline" and load metadata for all commits.

Note that this means we read all archived files first. Then do a second pass for details of commits in specific time range. This increases overall time taken by first command. In the example dataset, it took ~20 minutes for initial metadata to load.
Then subsequent commands are few seconds each.

With previous approach we only do one pass on files. All commands are few seconds each.

So I think we need to improve metadata to reduce time taken by first step with new approach.

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, took a second pass, left some more comments, once addressed we should be able to merge this

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satishkotha left 1 high level comment on the structuring of the constructor, if we can come to a conclusion on that, will merge as rest looks good.

@CliOption(key = {"headeronly"}, help = "Print Header Only", unspecifiedDefaultValue = "false")
final boolean headerOnly)
throws IOException {
if (StringUtils.isNullOrEmpty(startTs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a defaultValue during the cliOption itself so you don't have to use these checks ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLIOption attribute can only be assigned a static value. i think default static value is not that friendly. I chose to pick '10 days before today' which is likely to be commonly used. Let me know if you prefer default constant date instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that makes sense

loadInstants(startTs, endTs);
}

public void removeInstantDetailsFromMemory(String startTs, String endTs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : s/removeInstantDetailsFromMemory/clearInstantDetailsFromMemory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, renamed

/**
* Get all instants (commits, delta commits) that produce new data, in the active timeline.
*/
public HoodieTimeline getCommitsTimeline() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar can you ack that this is okay to move to this class ? This way the ArchivedTimeline can also use these methods (and is aligned with our thoughts on having the same operations on archived and active timeline..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar just pinging to resurface this since it might be lost in the large number of notifications :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash can't think of anything top of my mind.. Should be fine.

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satishkotha 2 minor comments, rest looks good to me, can merge after it's addressed. Also, once you've addressed the comments, could you please left a comment on the PR that you've addressed it ?

@satishkotha
Copy link
Member Author

Addressed comments and made some more improvements. Please take a look

@n3nash
Copy link
Contributor

n3nash commented Feb 4, 2020

@satishkotha one conflict, please rebase and then I can merge this

@satishkotha
Copy link
Member Author

@n3nash Fixed conflict. Note that test has been moved from hudi-common to hudi-client because sequence file based writing for archive log is no longer supported. Please take a look after Travis completes building

@@ -135,6 +135,7 @@
/**
* Total number of rollback blocks seen in a compaction operation.
*/
@Nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid this change as part of this diff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Without this we are not able to read certain archived commits

@@ -290,7 +291,7 @@ public long getTotalRollbackBlocks() {
return totalRollbackBlocks;
}

public void setTotalRollbackBlocks(Long totalRollbackBlocks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we can address this as part of some other refactoring diff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Without this we are not able to read certain archived commits

@zhangyue19921010
Copy link
Contributor

Hi guys, it seems that there 's a little problem with the regex pattern private static final Pattern ARCHIVE_FILE_PATTERN = Pattern.compile("^\\.commits_\\.archive\\.([0-9]*)$");

Just raise a PR #4521 trying to fix it. Wish you're interested and help me review? Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants