Skip to content

Comments

HDDS-4683. ListKeys: do lookup in dir and file tables#1811

Closed
rakeshadr wants to merge 11 commits intoapache:HDDS-2939from
rakeshadr:HDDS-4683
Closed

HDDS-4683. ListKeys: do lookup in dir and file tables#1811
rakeshadr wants to merge 11 commits intoapache:HDDS-2939from
rakeshadr:HDDS-4683

Conversation

@rakeshadr
Copy link
Contributor

What changes were proposed in this pull request?

This task is to perform #listKeys by searching the startKey, keyPrefix in Dir & File Tables.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4683

How was this patch tested?

Added UT.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakeshadr , haven't started the review. As I find there is one another similar task HDDS-4332(ListFileStatus - do lookup in directory and file tables). What's the major differences in these two task implementation? That would be helpful to review this PR if I know this.

@rakeshadr
Copy link
Contributor Author

@rakeshadr , haven't started the review. As I find there is one another similar task HDDS-4332(ListFileStatus - do lookup in directory and file tables). What's the major differences in these two task implementation? That would be helpful to review this PR if I know this.

Thank you @linyiqun for your help in reviews. Following are the points I learned after comparing listKeys and listStatus:

  1. ListKeys will skip the given startKey and won't be added to the finalList
  2. ListKeys will add trailing slash for the directory key names.
  3. ListKeys will recursively list all the keys starting from keyPrefix and startKey.
For ex: FS structure like, 

/a
/a/b
/a/b/c
/a/b/d

/a/b/file1
a/b/c/file2

#listKeys("/a/b") returns array values like, [a/b/, a/b/file1, a/b/c/, a/b/d/, a/b/c/file2]

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rakeshadr for the comments.
I take the initial review. Some places still not fully get the point. Please have a look.

startKey = OmUtils.normalizeKey(startKey, true);
keyPrefix = OmUtils.normalizeKey(keyPrefix, true);
startKey = OmUtils.normalizeKey(startKey, preserveTrailingSlash);
keyPrefix = OmUtils.normalizeKey(keyPrefix, preserveTrailingSlash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully get this, why we don't preserveTrailingSlash once isOmLayoutVersionV1 is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @linyiqun for the comments.

Yes, this logic is not required. I will remove this unnecessary isOmLayoutVersionV1 check in the PR.

I was having a thinking that, in V1 there won't be any trailing slash in the "keyName". As we know, in existing code a trailing slash in the keyName represents a directory path in KeyTable.

}
if (!StringUtils.equals(dirName, keyPrefix)) {
listKeysRecursivelyFromKeyPrefixV1(volumeName, bucketName, dirName,
maxKeys, keyList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxKeys should be updated when doing recursively list keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxKeys represents the limit and the batch size. During iteration and seeks, keyList.size() shouldn't exceed this limit. So, maxKeys is static.

keyList.size() <= maxKeys

Copy link
Contributor

@linyiqun linyiqun Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, one improvement I am thinking: can we do a check if keyList.size() already reach maxKeys, if reached, no need to call listKeysRecursivelyFromKeyPrefixV1 for dir and can break the loop. This can avoid unnecessary listKeysRecursivelyFromKeyPrefixV1 be invoked for subsequent sub-dirs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do the check.

// Iteration-2:
// startKeyPos="dir1/dir2/dir3", parentKeyPrefix="dir1/dir2"
// Iteration-3:
// startKeyPos="dir1/dir2", parentKeyPrefix="dir1"
Copy link
Contributor

@linyiqun linyiqun Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Edit my previous comments:
In the recursively search of startKey dir1/dir2/dir3, dir1/dir2, will it add the repeated entries in parent search?

E.g.
There are 5 files:
dir1/dir2/dir3/file1
dir1/dir2/dir3/file2
dir1/dir2/dir3/file3
dir1/dir2/dir3/file4

Iteration-1:
startKeyPos="dir1/dir2/dir3/file3", parentKeyPrefix="dir1/dir2/dir3"
keylist[dir1/dir2/dir3/file4]

Iteration-2:
startKeyPos="dir1/dir2/dir3", parentKeyPrefix="dir1/dir2"
keylist[dir1/dir2/dir3/file4, dir1/dir2/dir3/file1, dir1/dir2/dir3/file2, dir1/dir2/dir3/file3, dir1/dir2/dir3/file4]

In Iteration-2, will it search for the dir1/dir2/dir3 again? I think it will be skipped as dir1/dir2/dir3 is the new startKey will be skipped.

Copy link
Contributor Author

@rakeshadr rakeshadr Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @linyiqun for the example case. Please see my below reply.

In Iteration-2, will it search for the dir1/dir2/dir3 again? I think it will be skipped as dir1/dir2/dir3 is the new startKey will be skipped.

Yes, it will skip the startKey from the resultList when startKey is not empty. Also, it won't search all the files under dir1/dir2/dir3. Please read the below paragraph for the algorithm used in this patch.

Reason: the listKeysV1() algorithm is written like a template fashion with the following steps irrespective of startKey empty or non-empty:
step-1) It will fetch all the files under the given path.
step-2) It will fetch all the directories under the given path.

If the startKey is a directory, then the assumption is that all the files under this key was already visited in previous batch. So, in your example, you have passed startKeyPos="dir1/dir2/dir3". Here it will check any nextString directory exists after the startKeyPos in DirTable. In your example case, we don't have any , so will return empty dir list. Assume, there was "dir1/dir2/dir4" exists, then it would have added to the result list.
Now, it won't go to FileTable and list all keys after startKeyPos="dir1/dir2/dir3", instead it will move startKeyPos pointer to previous dir "dir1/dir2" and do a check !StringUtils.equals(parentStartKeyPath, keyPrefix). This return false and we have finished the listing key operation.

So the resultList in Iteration-2 case is, empty list []

Hope this explanation is clear to you.

protected List<OzoneFileStatus> listOzoneFileStatusesV1(boolean recursive,
String startKey, int countEntries, long numEntries, String clientAddress,
String volumeName, String bucketName, String keyName,
boolean sortAddress, boolean listKeys) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this as a private method, I see this is privately used in this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

}

@Test
public void testListKeysWithNotNormalizedPath() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current test case only cover few cases of listKeysV1. Actually like startKey, maxKeys of listKeys are not tested. However, I also didn't find a good unit test added for original listKeys call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add more test cases. I'm thinking something like large dirs and sub-dirs/files.

Also, there is an open point: #listKeys("key-a-"), basically partial keyName and startKey. For example, actual keyName stored in DB are "key-a-10", "key-a-11", "key-a-12"..."key-z-10", "key-z-11", "key-z-12".
I will raise separate jira and work it on.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakeshadr , I read the doc of listKey algorithm. And some comments I left below. Please have a look.

}
} else {
// TODO: Handle startKey not starting with keyPrefix. Should we suport
// this case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If startKey does not start with key prefix, we can just compare the their value.

  • If key prefix already behind the startKey, this search equals to the listKeys(key prefix , startKey='') as all entries with key prefix should behind the startKey.
  • If key prefix not behind the startKey, the empty list expected to be returned since startKey behind all the prefix key entries.

Does this handling make the sense here?

@rakeshadr rakeshadr force-pushed the HDDS-2939 branch 3 times, most recently from ea6116a to 37178c5 Compare February 1, 2021 05:37
@rakeshadr rakeshadr force-pushed the HDDS-2939 branch 3 times, most recently from bc6d651 to ec62cdb Compare February 5, 2021 09:21
@mukul1987
Copy link
Contributor

@rakeshadr I think the patch needs to be rebased, there are multiple unrelated changes.

@rakeshadr rakeshadr force-pushed the HDDS-2939 branch 2 times, most recently from 2c7fc4a to e01a1ef Compare February 15, 2021 05:55
@rakeshadr
Copy link
Contributor Author

Thanks a lot @linyiqun for the detailed reviewing on this patch.

On a second thought, I'd like to avoid more complex logic on the server side. I thought of re-using the existing proxy#listStatus() client API and making recursion logic at the client side. I've created a new PR-1954. I'd greatly appreciate your feedback on that. Thanks again!

Its just a first implementation and I think few more follow up jira tasks required to fully refine the client side logic by introducing batch fetching on levels incrementally etc etc. Will plan that work one-by-one.

@rakeshadr rakeshadr closed this Feb 23, 2021
@linyiqun
Copy link
Contributor

On a second thought, I'd like to avoid more complex logic on the server side. I thought of re-using the existing proxy#listStatus() client API and making recursion logic at the client side. I've created a new PR-1954. I'd greatly appreciate your feedback on that. Thanks again!

@rakeshadr . Will have a look recent days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants