HDDS-5126. Recon should check new containers of a container report with batch#2172
HDDS-5126. Recon should check new containers of a container report with batch#2172avijayanhwx merged 1 commit intoapache:masterfrom
Conversation
c442ab3 to
ac893dd
Compare
ac893dd to
c51bbfc
Compare
|
@avijayanhwx @GlenGeng please take a review, thx! |
c51bbfc to
9bfae1a
Compare
|
@avijayanhwx , can you please take a look at this? |
avijayanhwx
left a comment
There was a problem hiding this comment.
Thank you for working on this @JacksonYao287. Approach looks good to me, I have couple of comments inline.
But, I still think this is not a scalable solution. We have to boostrap Recon with the SCM DB on startup (HDDS-2852) to solve this problem as well as something like HDDS-4355.
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconContainerManager.java
Outdated
Show resolved
Hide resolved
...p-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/scm/TestReconContainerManager.java
Outdated
Show resolved
Hide resolved
|
@avijayanhwx thanks for your review!
yes, I agree. but this patch can Speed up processing the container report for recon. Batch is faster than one-by-one. when the number of containers in an ozone is not particularly large, i think this will have a Significant improvement. BTW, I will try to complete HDDS-4177.Bootstrap Recon SCM Container DB later. |
ee5799a to
5a95af5
Compare
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconContainerManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
stream().filter() may be simpler and more straightforward here.
There was a problem hiding this comment.
here , I want to divide "containerReplicaProtoList" into two parts , and both of the two parts will be used in the subsequent code。if i use "stream().filter()" instead, only one of the two parts got, so I have to call "stream().filter()" twice.
There was a problem hiding this comment.
I think these codes can be simpler, seems too redundant for lambda here.
There was a problem hiding this comment.
If we finally adopt the batch solution, can we drop this singular method ?
There was a problem hiding this comment.
thanks for the review! maybe we can maintain it here for the present, and remove it later in another patch
There was a problem hiding this comment.
I think the original check is kind of optimistic locking here?
There was a problem hiding this comment.
@symious Thanks for the review! "addNewContainer" will call "ContainerStateMap#contains" ,and this is what "containerExist" exactly does, so no need here.
There was a problem hiding this comment.
There are still some steps that need to be taken before addNewContainer calls ContainerStateMap$contains, with the contains check here, the performance would be better.
Besides, in ContainerStateManagerImpl, the writeLock would be required to check if the container exists, it's not reasonable to involve the lock here.
There was a problem hiding this comment.
in my opinion ,"check the existence and add" should be an atomic operation( just like CAS operation), and actually this is what ContainerStateManagerImpl#addContainer does (as you mentioned that a write lock is involved)
lock.writeLock().lock();
try {
if (!containers.contains(containerID)) {
add it;
} finally {
lock.writeLock().unlock();
}
so, if we call "containerExist" before "addNewContainer", it seem like that we make a compare operation before an atomic CAS operation, and it seems redundant.
There are still some steps that need to be taken before
addNewContainercallsContainerStateMap$contains
yes, so the later we call ContainerStateMap$contains , the better result we may get, because the container existence maybe changed within the time window, in which these steps is executed.
There was a problem hiding this comment.
"check the existence and add" isn't meant to be an atomic operation here, it's just used to avoid some unnecessary operations if the other clients have updated the containerId already.
5a95af5 to
58a28de
Compare
...ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconContainerReportHandler.java
Outdated
Show resolved
Hide resolved
6aae097 to
beb21e0
Compare
beb21e0 to
18ed781
Compare
18ed781 to
3363ed8
Compare
|
@avijayanhwx @symious @GlenGeng please take a review. 2 fix a bug. for now , a closed container may be reported by datanode, and the pipeline of this container may be closed , recon can not sync this pipeline info from scm. so recon should have this ability and a no-open container to it's |
avijayanhwx
left a comment
There was a problem hiding this comment.
I am OK with getting this patch as it is now, but would like to revisit/remove it when we do SCM DB boostrap on Recon. Generally, I am not for adding ad hoc APIs in SCM to support Recon. Slowly, the SCM API layer will become harder to maintain. @symious / @GlenGeng if you are ok with this approach as well, I can get this in.
|
@avijayanhwx Thanks for the review. I think we can keep the following check. Since the inner check @JacksonYao287 mentioned would involve additional write lock operations. |
@avijayanhwx please take a look, what is your opinion? |
avijayanhwx
left a comment
There was a problem hiding this comment.
I have added a TODO in HDDS-4177 to revisit this logic. For now, I am +1 with this approach since it is causing an OOM issue.
|
Thanks for working on this @JacksonYao287 and the reviews @symious, @GlenGeng. |
What changes were proposed in this pull request?
in my test environment, 400000 containers exist. when bootstrap a new recon, every container will be checked and added to recondb.but , for now , recon check all the containers in a container report one by one , and each check will take a rpc call to scm. this is too slow and in my test environment , it leads to recon oom, because too many containers to be consumed are waitting in the message queue . It is better for recon to check new containers of a container report with batch.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5126
Please replace this section with the link to the Apache JIRA)
How was this patch tested?
unit test