Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having large numbers of networks can cause node allocation to fail #2655

Open
dperny opened this issue Jun 4, 2018 · 2 comments
Open

Having large numbers of networks can cause node allocation to fail #2655

dperny opened this issue Jun 4, 2018 · 2 comments

Comments

@dperny
Copy link
Collaborator

dperny commented Jun 4, 2018

Since 9fa9ce1, nodes are now allocated with a network attachment for every network.

Unfortunately, this can cause the size of a Node object to exceed the maximum raft message size, which prevents it from being committed to the object store.

From moby/moby#36792

Forgot to mention the same error was triggered by a different action than the one initially reported on this issue, in my case it appeared when creating ~1700 networks and deploying one service on each network.

Jun 01 00:39:52 ip-172-16-0-128 dockerd[1284]: time="2018-06-01T00:39:52.220386603Z" level=error msg="Failed to commit allocation of network resources for node rul9pnxcc2hpj3o7eya1redpk" error="raft: raft message is too large and can't be sent" module=node node.id=281crvu

There are many possible fixes, including disallowing too many attachments, but one way or another we can't wind up with raft messages too large like this.

/cc @ctelfer

@ddtmachado
Copy link

Hey @dperny this 9fa9ce1 also degraded the performance of docker managers and service creation in general on the situation described by @eduardolundgren

I can show more precise values, steps and the kind of stress test on the next run. but right now I see like 300% increase in CPU usage on the swarm leader (average use was 15% CPU while running 17.09 and 50% running versions after that commit like 17.12, 18.03 and 18.05)

Basic scenario was a swarm cluster of 3 managers and 12 workers, then running a script that created 2000 networks and 2000 services using those networks.

@eduardolundgren
Copy link

@dperny any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants