New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm introduces linear performance degradation as #networks or #containers increases #1752

Open
ezrasilvera opened this Issue Feb 3, 2016 · 10 comments

Comments

Projects
None yet
7 participants
@ezrasilvera
Contributor

ezrasilvera commented Feb 3, 2016

It seems that Swarm introduces linear performance degradation as the number of networks and/or the number of containers increases. I was focusing mainly on starting a container and creating a network, but this issue affect many other commands like network delete, container delete, etc..

I made some profiling analysis and I believe that the source of this performance problem is the usage of RefreshNetworks() and RefreshContainers().

I think there are few issues with the current design/implementation:
Duplication - when we initiate "create container" for example- Swarm explicitly calls refreshNetwork, however when the create container event will be received Swarm will initiate another refresh on the specific engine
Redundancy - When creating a container or a network Swarm calls refreshNetwork which goes over all engines and all networks on each engine. We we cerate a resources (container or network) we can actually do a focused discovery (e.g., during createNetwork send a listNetworks only for the id of the newly created network)

As Docker now supports overlay networks the situation becomes much worse as each engine may see many networks!

I also made some tests with eliminating some of the redundant refresh calls and indeed most of the Swarm overhead was removed

A real fix would probably require some more investigation as currently the relevant functions are also used in situations in which you do want to discover everything (like swarm restart)
swarm-nw

@amitshukla amitshukla added this to the 1.1.1 milestone Feb 3, 2016

@ezrasilvera

This comment has been minimized.

Show comment
Hide comment
@ezrasilvera

ezrasilvera Feb 3, 2016

Contributor

@abronan I'm not sure this is a network specific issue. I think it's a wider issue related to the overall Swarm "discovery" mechanism. The ability to create user-defined network just brought the issue to the surface as the number of networks increases.
However we also see similar behavior related to # of containers

Contributor

ezrasilvera commented Feb 3, 2016

@abronan I'm not sure this is a network specific issue. I think it's a wider issue related to the overall Swarm "discovery" mechanism. The ability to create user-defined network just brought the issue to the surface as the number of networks increases.
However we also see similar behavior related to # of containers

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Feb 3, 2016

Contributor

@ezrasilvera thanks for the comprehensive report

We are definitely concerned about this and we want to improve performances.
I agree with you refreshNetworks() is the biggest issue, networks are the only place where you can do an API call on one engine and it's going to affect others engines, without swarm knowing about it.

That's why we are currently doing a refresh on every engine and that's really not ideal.

We are about to release 1.1.0 (tomorrow) but we are going to work on this very soon and release a 1.1.1 as soon as we reach significant improvement.

Do you mind sharing with us the branch you used for the reduced refreshNW tests ?

Contributor

vieux commented Feb 3, 2016

@ezrasilvera thanks for the comprehensive report

We are definitely concerned about this and we want to improve performances.
I agree with you refreshNetworks() is the biggest issue, networks are the only place where you can do an API call on one engine and it's going to affect others engines, without swarm knowing about it.

That's why we are currently doing a refresh on every engine and that's really not ideal.

We are about to release 1.1.0 (tomorrow) but we are going to work on this very soon and release a 1.1.1 as soon as we reach significant improvement.

Do you mind sharing with us the branch you used for the reduced refreshNW tests ?

@ezrasilvera

This comment has been minimized.

Show comment
Hide comment
@ezrasilvera

ezrasilvera Feb 3, 2016

Contributor

@vieux sure. will do.
I agree, that the NW has cross engine effect however when you manage a specific command (i.e., create container or network) you can limit the discovery - for example we can change RefrehNetwork on the engine to accept ID/filter so that we can do ListNetworks only on the NW ID you just created. and received. And when create/start a container you just need to update the container list linked to the network of that container ...

Contributor

ezrasilvera commented Feb 3, 2016

@vieux sure. will do.
I agree, that the NW has cross engine effect however when you manage a specific command (i.e., create container or network) you can limit the discovery - for example we can change RefrehNetwork on the engine to accept ID/filter so that we can do ListNetworks only on the NW ID you just created. and received. And when create/start a container you just need to update the container list linked to the network of that container ...

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Feb 3, 2016

Contributor

@ezrasilvera agreed, also engine 1.10 (to be released tomorrow) also adds event on network creation/deletion/connect/disconnect

This removes the need for most of the refreshNetworks()

Contributor

vieux commented Feb 3, 2016

@ezrasilvera agreed, also engine 1.10 (to be released tomorrow) also adds event on network creation/deletion/connect/disconnect

This removes the need for most of the refreshNetworks()

@chanwit

This comment has been minimized.

Show comment
Hide comment
@chanwit

chanwit Feb 4, 2016

Contributor

I think this relates to #1750 with the better and nice graph.

Contributor

chanwit commented Feb 4, 2016

I think this relates to #1750 with the better and nice graph.

@ezrasilvera

This comment has been minimized.

Show comment
Hide comment
@ezrasilvera

ezrasilvera Feb 4, 2016

Contributor

@chanwit I don't think those issues are related. This issue is not specific to the latest version and will not be noticed with a single NW. You need to create multiple (many .. :-) ) networks in order to see that.

Contributor

ezrasilvera commented Feb 4, 2016

@chanwit I don't think those issues are related. This issue is not specific to the latest version and will not be noticed with a single NW. You need to create multiple (many .. :-) ) networks in order to see that.

@chanwit

This comment has been minimized.

Show comment
Hide comment
@chanwit

chanwit Feb 4, 2016

Contributor

You need to create multiple (many .. :-) ) networks in order to see that.

@ezrasilvera I see. This is the point. When you creating new networks, cluster store needs to handle additional loads. I'm still thinking it's the cause and we were seeing it clearly in our lab that the cluster store's CPU were spiking.

However, I'll be also finding the root cause.

Contributor

chanwit commented Feb 4, 2016

You need to create multiple (many .. :-) ) networks in order to see that.

@ezrasilvera I see. This is the point. When you creating new networks, cluster store needs to handle additional loads. I'm still thinking it's the cause and we were seeing it clearly in our lab that the cluster store's CPU were spiking.

However, I'll be also finding the root cause.

@ezrasilvera

This comment has been minimized.

Show comment
Hide comment
@ezrasilvera

ezrasilvera Feb 4, 2016

Contributor

@chanwit I fully agree that the cluster store also contribute to performance degradation, and we have seen that as well . We also saw intensive CPU utilization and traffic for that.
However for this issue I actually think we know what is the root cause and I believe we can eliminate most of the overhead especially as @vieux said we can now listen to NW evenest and get rid of the refreshNetwork()
I would keep a separate issue directly related to cluster store as there might be some unrelated problems with that as well.

Contributor

ezrasilvera commented Feb 4, 2016

@chanwit I fully agree that the cluster store also contribute to performance degradation, and we have seen that as well . We also saw intensive CPU utilization and traffic for that.
However for this issue I actually think we know what is the root cause and I believe we can eliminate most of the overhead especially as @vieux said we can now listen to NW evenest and get rid of the refreshNetwork()
I would keep a separate issue directly related to cluster store as there might be some unrelated problems with that as well.

@chanwit

This comment has been minimized.

Show comment
Hide comment
@chanwit

chanwit Feb 4, 2016

Contributor

@ezrasilvera thanks for that I'll keep follow this issue as it would solve my problem too :):):)

Contributor

chanwit commented Feb 4, 2016

@ezrasilvera thanks for that I'll keep follow this issue as it would solve my problem too :):):)

@ezrasilvera

This comment has been minimized.

Show comment
Hide comment
@ezrasilvera

ezrasilvera Feb 16, 2016

Contributor

I tested again with the latest code that includes @vieux changes of #1777 . The results are better but there is still a gap compared to pure engine. I'm going to analyze the root-cause later on ..
create_nw_v1

Contributor

ezrasilvera commented Feb 16, 2016

I tested again with the latest code that includes @vieux changes of #1777 . The results are better but there is still a gap compared to pure engine. I'm going to analyze the root-cause later on ..
create_nw_v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment