Skip to content

HDDS-5126. Recon should check new containers of a container report with batch#2172

Merged
avijayanhwx merged 1 commit intoapache:masterfrom
JacksonYao287:HDDS-5126
May 10, 2021
Merged

HDDS-5126. Recon should check new containers of a container report with batch#2172
avijayanhwx merged 1 commit intoapache:masterfrom
JacksonYao287:HDDS-5126

Conversation

@JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

in my test environment, 400000 containers exist. when bootstrap a new recon, every container will be checked and added to recondb.but , for now , recon check all the containers in a container report one by one , and each check will take a rpc call to scm. this is too slow and in my test environment , it leads to recon oom, because too many containers to be consumed are waitting in the message queue . It is better for recon to check new containers of a container report with batch.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5126

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

unit test

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5126 branch 4 times, most recently from c442ab3 to ac893dd Compare April 22, 2021 12:00
@avijayanhwx avijayanhwx self-requested a review April 22, 2021 14:12
@JacksonYao287
Copy link
Contributor Author

@avijayanhwx @GlenGeng please take a review, thx!

@bshashikant
Copy link
Contributor

@avijayanhwx , can you please take a look at this?

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @JacksonYao287. Approach looks good to me, I have couple of comments inline.

But, I still think this is not a scalable solution. We have to boostrap Recon with the SCM DB on startup (HDDS-2852) to solve this problem as well as something like HDDS-4355.

@JacksonYao287
Copy link
Contributor Author

@avijayanhwx thanks for your review!

But, I still think this is not a scalable solution. We have to boostrap Recon with the SCM DB on startup (HDDS-2852) to solve this problem as well as something like HDDS-4355.

yes, I agree. but this patch can Speed ​​up processing the container report for recon. Batch is faster than one-by-one. when the number of containers in an ozone is not particularly large, i think this will have a Significant improvement.

BTW, I will try to complete HDDS-4177.Bootstrap Recon SCM Container DB later.

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5126 branch 2 times, most recently from ee5799a to 5a95af5 Compare April 25, 2021 02:59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream().filter() may be simpler and more straightforward here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here , I want to divide "containerReplicaProtoList" into two parts , and both of the two parts will be used in the subsequent code。if i use "stream().filter()" instead, only one of the two parts got, so I have to call "stream().filter()" twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these codes can be simpler, seems too redundant for lambda here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we finally adopt the batch solution, can we drop this singular method ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review! maybe we can maintain it here for the present, and remove it later in another patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original check is kind of optimistic locking here?

Copy link
Contributor Author

@JacksonYao287 JacksonYao287 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious Thanks for the review! "addNewContainer" will call "ContainerStateMap#contains" ,and this is what "containerExist" exactly does, so no need here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some steps that need to be taken before addNewContainer calls ContainerStateMap$contains, with the contains check here, the performance would be better.
Besides, in ContainerStateManagerImpl, the writeLock would be required to check if the container exists, it's not reasonable to involve the lock here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion ,"check the existence and add" should be an atomic operation( just like CAS operation), and actually this is what ContainerStateManagerImpl#addContainer does (as you mentioned that a write lock is involved)

lock.writeLock().lock();

try {
if (!containers.contains(containerID)) {
 add it;
}  finally {
lock.writeLock().unlock();
}

so, if we call "containerExist" before "addNewContainer", it seem like that we make a compare operation before an atomic CAS operation, and it seems redundant.

There are still some steps that need to be taken before addNewContainer calls ContainerStateMap$contains

yes, so the later we call ContainerStateMap$contains , the better result we may get, because the container existence maybe changed within the time window, in which these steps is executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"check the existence and add" isn't meant to be an atomic operation here, it's just used to avoid some unnecessary operations if the other clients have updated the containerId already.

@JacksonYao287 JacksonYao287 force-pushed the HDDS-5126 branch 3 times, most recently from 6aae097 to beb21e0 Compare April 27, 2021 11:45
@adoroszlai adoroszlai changed the title HDDS-5126.Recon should check new containers of a container report wit… HDDS-5126. Recon should check new containers of a container report with batch Apr 27, 2021
@JacksonYao287 JacksonYao287 reopened this Apr 29, 2021
@JacksonYao287
Copy link
Contributor Author

@avijayanhwx @symious @GlenGeng please take a review.
I rewrite this pr , there a two big changes :
1 add a new SCM service "getExistContainerWithPipelinesInBatch" , which will ask scm that which containers in the given list can be found by scm and will never throw an exception at client-side. "ReconContainerManager#checkAndAddNewContainerBatch" now use this method instead of "getContainerWithPipelineBatch" , which may throw an exception and drop the whole reports in a heartbeat.

2 fix a bug. for now , a closed container may be reported by datanode, and the pipeline of this container may be closed , recon can not sync this pipeline info from scm. so recon should have this ability and a no-open container to it's containerStateMap without the corresponding pipeling info.

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with getting this patch as it is now, but would like to revisit/remove it when we do SCM DB boostrap on Recon. Generally, I am not for adding ad hoc APIs in SCM to support Recon. Slowly, the SCM API layer will become harder to maintain. @symious / @GlenGeng if you are ok with this approach as well, I can get this in.

@symious
Copy link
Contributor

symious commented May 4, 2021

@avijayanhwx Thanks for the review. I think we can keep the following check.

      if (!containerExist(containerID)) {
        addNewContainer(containerID.getId(), containerWithPipeline);
      }

Since the inner check @JacksonYao287 mentioned would involve additional write lock operations.

@JacksonYao287
Copy link
Contributor Author

@avijayanhwx Thanks for the review. I think we can keep the following check.

      if (!containerExist(containerID)) {
        addNewContainer(containerID.getId(), containerWithPipeline);
      }

Since the inner check @JacksonYao287 mentioned would involve additional write lock operations.

@avijayanhwx please take a look, what is your opinion?

@JacksonYao287 JacksonYao287 requested a review from avijayanhwx May 9, 2021 03:24
Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a TODO in HDDS-4177 to revisit this logic. For now, I am +1 with this approach since it is causing an OOM issue.

@avijayanhwx avijayanhwx merged commit a16e648 into apache:master May 10, 2021
@avijayanhwx
Copy link
Contributor

Thanks for working on this @JacksonYao287 and the reviews @symious, @GlenGeng.

@JacksonYao287 JacksonYao287 deleted the HDDS-5126 branch July 20, 2021 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants