HDDS-6312. Use KeyPrefixContainer table to accelerate the process of DELETE/UPDATE events#3082
HDDS-6312. Use KeyPrefixContainer table to accelerate the process of DELETE/UPDATE events#3082smengcl merged 13 commits intoapache:masterfrom
Conversation
|
@adoroszlai @avijayanhwx @ferhui Could you help to review this PR? |
|
@JacksonYao287 Could you please help review? |
|
@avijayanhwx Could you help to check this issue? In a small cluster, the connection from Recon to OM can be quite normal, with 10 minutes' interval. Please ignore the content in the following block, but each request is from Recon to OM to get delta updates. But in a big cluster, since the process is too slow, this connection can be far longer then 10min's interval. |
|
Thanks @symious for the patch and sorry for the long delay in review. Can you please resolve merge conflicts? |
|
@adoroszlai Thanks for the review. Updated the patch, please have a check. |
|
Thanks @symious for the patch. There's another patch causing minor conflicts. Would you merge latest master to the PR branch again? I have resolved the conflict on my branch (I tried to push to your PR branch but got no permission), you can take this as a reference: https://github.com/smengcl/hadoop-ozone/blob/HDDS-6312/hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerKeyMapperTask.java#L249-L275 |
|
@smengcl Updated the PR, please have a look. |
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/ContainerKeyMapperTask.java
Show resolved
Hide resolved
...one/recon/src/main/java/org/apache/hadoop/ozone/recon/spi/ReconContainerMetadataManager.java
Outdated
Show resolved
Hide resolved
| // When reading from byte[], we can always expect to have the key, version | ||
| // and version parts in the byte array. | ||
| byte[] keyBytes = ArrayUtils.subarray(rawData, | ||
| 0, rawData.length - Long.BYTES * 2 - 2); |
There was a problem hiding this comment.
We are encoding and decoding the DB entry value manually, with error-prone offset calculations here. But since ContainerKeyPrefixCodec was already doing this, this should be acceptable.
I wonder if it makes sense to just add a new proto message type, so we can leverage protobuf instead? If it is worth it I think we should, depending on how much more efficient (spatial and time) it can be if we switch to protobuf.
Just trying to open up a discussion here. IMO it is fine use manual encode/decode here to make it consistent with the containerKeyTable we already have. Maybe in a future jira we can switch both tables to use protobuf to serialize the persisted DB value and call those tables _V2.
There was a problem hiding this comment.
Noted with thanks, I'll try to create a new ticket for this transition.
.../src/main/java/org/apache/hadoop/ozone/recon/spi/impl/ReconContainerMetadataManagerImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/ozone/recon/spi/impl/ReconContainerMetadataManagerImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/ozone/recon/spi/impl/ReconContainerMetadataManagerImpl.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/types/KeyPrefixContainer.java
Outdated
Show resolved
Hide resolved
…/spi/ReconContainerMetadataManager.java Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
…/spi/impl/ReconContainerMetadataManagerImpl.java Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
…/spi/impl/ReconContainerMetadataManagerImpl.java Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
|
@smengcl Updated the patch, please have a look. |
|
@smengcl Thank you for the review. |
|
@smengcl Do you have any suggestions on the failed unit test? |
The failure looks related at a first glance: https://github.com/apache/ozone/runs/8008158105 Error: org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB Time elapsed: 0.249 s <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at org.apache.hadoop.ozone.recon.recovery.TestReconOmMetadataManagerImpl.testUpdateOmDB(TestReconOmMetadataManagerImpl.java:139)But Line 139 in TestReconOmMetadataManagerImpl.java doesn't really match any meaningful code on your branch, hmm. I ran the same test locally and it passed. Retriggering the CI. |
|
Thanks @symious for the patch. Thanks @adoroszlai for the review. |
|
@symious , question -- what is the reason for adding |
@szetszwo It is to solve the slowness of processing. |
|
@symious , I understand that this is a to fix the slowness. Well done! My question is why we cannot reuse |
The slowness was coming from the operation of "getContainerForKeyPrefixes", since With the help of "KeyPrefixContainer", the operation of "getContainerForKeyPrefixes" will be quite fast. |
|
@symious , sorry, I still cannot understand. I probably have not asked the right question. These two classes (shown below) look the same. Why we need to add public class ContainerKeyPrefix {
private long containerId;
private String keyPrefix;
private long keyVersion = -1;public class KeyPrefixContainer {
private String keyPrefix;
private long keyVersion = -1;
private long containerId = -1;
How it helps? |
|
@szetszwo Oh, I got your question. I think I created a new class for the readability. It is kind of confusing to have It's been a while of this change. But I remember I was using the same class for implementation at first, but I was getting confused when dealing with the two tables with different ordering of containerId and KeyPrefix. By using a different class, it's kind of easier to understand and write the logic. |
@symious , If the code are the same, it is better to reuse the code. //SCMMetadataStoreImpl
private Table<BigInteger, X509Certificate> validCertsTable;
private Table<BigInteger, X509Certificate> validSCMCertsTable;
private Table<BigInteger, X509Certificate> revokedCertsTable;For examples, we won't create a different |
Reasonable to me, also. |
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(containerId, keyPrefix, keyPrefix); |
There was a problem hiding this comment.
@symious , another question -- the repeated keyPrefix, keyPrefix looks like a typo. For fixing it, we will replace the last keyPrefix with keyVersion as below.
return Objects.hash(containerId, keyPrefix, keyVersion);However, the hash function will be changed. Now the question is -- Is it an incompatible change? Have the hash value used or written anywhere in the DB?
There was a problem hiding this comment.
Oh, thanks for spotting that.
I didin't find the persistence of the hash value. The only place I found being used should be in the "getContainerForKeyPrefixes" function, which should not cause incompatibilities.
What changes were proposed in this pull request?
Recon stores the mapping of ContainerKeyPrefix in local RocksDB. When Recon is applying DELETE or UPDATE events from OM, it will run search the whole table for each to_be_deleted record.
In a big cluster, the record count in this table could be very large, and the search loop for each records is very slow. In our cluster there are 90m records, each loop cost over 70 seconds, if a delta OM events have 100 DELETE or UPDATE events, it will took about two hours to apply these updates.
This ticket is to accelerate the process with the help of a new local table.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6312
How was this patch tested?
unit test.