Skip to content

HDDS-6428. [Merge rocksdb in datanode] Add prefix iterator support to RDBTable.#3176

Merged
nandakumar131 merged 2 commits intoapache:HDDS-3630from
guihecheng:HDDS-6428
Mar 29, 2022
Merged

HDDS-6428. [Merge rocksdb in datanode] Add prefix iterator support to RDBTable.#3176
nandakumar131 merged 2 commits intoapache:HDDS-3630from
guihecheng:HDDS-6428

Conversation

@guihecheng
Copy link
Contributor

@guihecheng guihecheng commented Mar 8, 2022

What changes were proposed in this pull request?

Add prefix iterator support to RDBTable.

A sample from the metadata column family of the current per-container rocksdb instance as follows:

"#BCSID" -> 9
"#BLOCKCOUNT" -> 1
"#BYTESUSED" -> 10240000

After rocksdb-merge, it will be like:

"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004#BCSID" -> 9
"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004#BLOCKCOUNT" -> 1
"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004#BYTESUSED" -> 10240000

Those chars as a prefix are the containerID encoded in IOS-8859-1(An extension of the ASCII charset).
After rocksdb-merge, every container on the disk will get all its metadata KV pairs in a per-disk rocksdb instance.
When we want to access the metadata KV of a specific container, we have to do a seek operation in the column family to the right postion.
Here we use a fixed-length prefix(an encoded containerID) so as to utilize the Prefix Seek feature of rocksdb to speed up the seek operation, FYI: https://github.com/facebook/rocksdb/wiki/Prefix-Seek.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6428

How was this patch tested?

new ut added.

@guihecheng guihecheng force-pushed the HDDS-6428 branch 2 times, most recently from 07348c4 to 7109588 Compare March 8, 2022 11:25
@mukul1987
Copy link
Contributor

@guihecheng can we please add an example of entries in the table which we expect ? And what is the need for fixed prefix length support, that will help in reviewing this further.

@guihecheng
Copy link
Contributor Author

@mukul1987 Oh, sure, I'll add more descriptions about the sample prefixed keys after rocksdb-merge and explain the need to have fixed length prefixes soon.

@guihecheng
Copy link
Contributor Author

@mukul1987 descriptions updated above, thanks~

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the patch looks good @guihecheng. Thanks for working on this.

Comment on lines 85 to 86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid Arrays.copyOf here as the value is not modified/passed outside. We can implement our own comparison logic instead.

Suggested change
(prefix == null || Arrays.equals(
Arrays.copyOf(rocksDBIterator.key(), prefix.length), prefix));
private static boolean startsWith(byte[] prefix, byte[] value) {
if (prefix==null)
return true;
if (value==null)
return false;
int length = prefix.length;
if (value.length < length)
return false;
for (int i=0; i<length; i++)
if (prefix[i] != value[i])
return false;
return true;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good idea to prevent copy, thanks~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have prefixLength associated with RDBTable. Prefix is something which should be passed as an argument. It is not applicable at table level.

For getRangeKVs calls we can use MetadataKeyFilters.MetadataKeyFilter for passing prefix to be matched.

If we need the Iterator optimized prefix filter to be used, we can add a new getSequentialRangeKVs method which also takes prefix as an argument. (Since we cannot use MetadataKeyFilters.MetadataKeyFilter as prefix filter for Iterator creation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is better to prevent a prefixLength set at the table level.

Here I tried to have unified interface for different schemas, so I'll try to use the MetadataKeyFilters.MetadataKeyFilter to pass prefix but not introducing new method dedicated for schema v3.

Copy link
Contributor Author

@guihecheng guihecheng Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...I found it not so easy to pass then extract prefix using the MetadataKeyFilters.MetadataKeyFilter, because the KeyFilter interface doesn't offer convenient interface for that and hard to refactor. And also we haven't introduced the fixed-length prefix stuff for schemaV3 keys and we can't assume the char encoding in general purpose classes like MetadataKeyFilters.MetadataKeyFilter or RDBTable.

I'm trying to add a new parameter prefix to the getRangeKVs & getSequentialRangeKVs interfaces.
I feel that it is painful to do so, but I would like to have abstracted interfaces for all schemas instead of if-else checks in each call place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not set prefix length at Table level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

markgui added 2 commits March 23, 2022 17:50
@guihecheng
Copy link
Contributor Author

Hi @nandakumar131 , I'm sorry that I have to do a force push since I did a rebase on HDDS-3630, because the new changes will conflict with the merged HDDS-6404.

Here I did remove the setFixedPrefixLength at table level, but to pass the necessary prefix to the rocksdb, I have to add a new parameter to the interfaces getRangeKVs and getSequentialRangeKVs .
Originally I tends to extract the prefix in the startKey with the prefixLength, but it is not good.
And I tried to extract the prefix from the KeyFilters, but the prefix should go through a proper the codec classes in TypedTable, but the KeyFilter has its own way to manage positive and negative prefixes, which is hard to refactor.

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guihecheng for updating the PR.

@nandakumar131 nandakumar131 merged commit b22dfd7 into apache:HDDS-3630 Mar 29, 2022
guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants