Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set grpc max message size to 128MB. #2375

Merged
merged 1 commit into from
Sep 19, 2017
Merged

Conversation

anshulpundir
Copy link
Contributor

Signed-off-by: Anshul Pundir anshul.pundir@docker.com

@codecov
Copy link

codecov bot commented Sep 15, 2017

Codecov Report

Merging #2375 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2375      +/-   ##
==========================================
+ Coverage   60.35%   60.45%   +0.09%     
==========================================
  Files         128      128              
  Lines       26267    26268       +1     
==========================================
+ Hits        15854    15880      +26     
+ Misses       9011     8993      -18     
+ Partials     1402     1395       -7

Copy link
Contributor

@nishanttotla nishanttotla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know the full side effects of doing this. My hunch is that message encoding gets affected for larger message size limits. Also related: https://stackoverflow.com/questions/34128872/google-protobuf-maximum-size

Ping @stevvooe.

@wsong
Copy link

wsong commented Sep 15, 2017

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

@anshulpundir
Copy link
Contributor Author

anshulpundir commented Sep 15, 2017

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

Agreed @wsong

I'd like to know the full side effects of doing this.

I'd like to also :)

My hunch is that message encoding gets affected for larger message size limits.

I think encoding and decoding depends on actual message size and the structure of the data. For example, a deeply nested struct would be more costly to encode/decode. I don't think changing the limit affects encoding/decoding performance. @nishanttotla

@aluzzardi
Copy link
Member

The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

I'm not sure about that - when moving to the "proper" solution (streaming), the config option won't make sense anymore.

I don't know if it makes sense to introduce a new option in a patch release just so we can deprecate it immediately after.

@@ -231,6 +231,7 @@ func New(config *Config) (*Manager, error) {
grpc.Creds(config.SecurityConfig.ServerTLSCreds),
grpc.StreamInterceptor(grpc_prometheus.StreamServerInterceptor),
grpc.UnaryInterceptor(grpc_prometheus.UnaryServerInterceptor),
grpc.MaxMsgSize(1024 * 1024 * 128),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be a constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use shift syntax: 128 << 20.

@aluzzardi
Copy link
Member

Overall looks good

What is going to happen when we reach 128M? Any way we could log a proper error message?

@nishanttotla
Copy link
Contributor

I think some log messages should be included to bubble up errors from hitting the gRPC limit (either in this PR or as an immediate follow up).

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition? If we do implement streaming later, then we'd want to bring the limit down again. This may cause backward compatibility issues.

I agree with @aluzzardi that if we introduce this option, it doesn't make sense to immediately deprecate it.

@anshulpundir
Copy link
Contributor Author

We just met and agreed to not add the knob to adjust the max grpc message size.
I will manually test the positive test case today and investigate adding an integration test for positive and negative scenarios.

@stevvooe
Copy link
Contributor

-1 on making this configurable.

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition?

Yes, this is a horrible thing to have as a scaling limit and it needs to be addressed. Besides patching this as a workaround, the next focus should be actually fixing this to handle arbitrary snapshot sizes. Increasing the message size limit and calling it good is not an acceptable engineering solution. Increasing this limit exposes us to message floods on all services, not just the raft service.

Is there anything we can add to log messages that are over some threshold, so that we can have idea of the level of impact this is having? Prom counters bucketed by size might be appropriate. Such large messages will have an effect on stability at around 10MB and up with protobuf. The allocations required for processing a 128MB message may cause stability issues. I'd really like to know if there are messages that are larger than 1MB transiting swarm GRPC endpoints, as those are likely causing "buffer bloat" and allocation spikes. We really need to have an easy way for users to understand when they are impacted by this problem.

@wsong
Copy link

wsong commented Sep 18, 2017

Yeah, after meeting offline, I agree that this should not be configurable.

Adding a metric to track large (i.e. > 1MB) GRPC messages is definitely a good idea, but it should probably go in a separate PR. Logging the length of every single message before sending it might have some unforeseen performance implications.

@stevvooe
Copy link
Contributor

Logging the length of every single message before sending it might have some unforeseen performance implications.

No really what I was suggesting:

if len(msg) > 1 << 20 {
 // log it.
}

Better would be to have bucketed counters (unrolled below):

if len(msg) >= 1 << 20 {
  gt1MB++
}

if len(msg) >= 4 << 20 {
  gt4MB++
}

if len(msg) >= 10 << 20 {
  gt10MB++
}

if len(msg) >= 16 << 20 {
  gt16MB++
}
...

Notice that this is a cumulative counter setup, rather than bucketed, which allows easy aggregates.

probably go in a separate PR.

Sure, but let's make sure it gets backported: right now, this condition is hard to diagnose and debug.

@anshulpundir
Copy link
Contributor Author

Manually tested for now. Positive test-case for around 120Mib. Negative for around 240Mib. Will explore the metrics and error reporting on a separate PR.

@andrewhsu
Copy link
Member

Hmm...looks like the linter changed a few hours ago: golang/lint#319

I'm seeing these errors from the circleci job:

🐳 lint
ca/keyreadwriter.go:190:2: redundant if ...; err != nil check, just return error instead.
manager/allocator/network.go:175:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/network.go:99:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:59:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:164:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:484:2: redundant if ...; err != nil check, just return error instead.
manager/dispatcher/dispatcher.go:857:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:405:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:527:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:658:3: redundant if ...; err != nil check, just return error instead.
manager/orchestrator/update/updater.go:387:4: redundant if ...; err != nil check, just return error instead.
manager/scheduler/scheduler.go:95:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1286:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1851:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/storage/storage.go:229:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/transport/transport.go:238:2: redundant if ...; err != nil check, just return error instead.
make: *** [lint] Error 1

@marcusmartins
Copy link

@anshulpundir @nishanttotla the CI is failing on lint. Do you know why?

@thaJeztah
Copy link
Member

Can we pin the linter to a specific version to prevent that?

@wsong
Copy link

wsong commented Sep 19, 2017

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

@anshulpundir
Copy link
Contributor Author

anshulpundir commented Sep 19, 2017

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

Doing this right now.

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
@anshulpundir anshulpundir merged commit b1bcc05 into moby:master Sep 19, 2017
@anshulpundir anshulpundir deleted the msg_size branch September 19, 2017 18:03
anshulpundir pushed a commit that referenced this pull request Sep 19, 2017
* Fix linter errors.

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit aa2c48b)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

* Set grpc max message size to 128MB. (#2375)

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>(cherry picked from commit b1bcc05)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

* run vndr again because it now deletes unused files

Signed-off-by: Andrew Hsu <andrewhsu@docker.com>
vieux pushed a commit that referenced this pull request Sep 19, 2017
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Victor Vieux <victorvieux@gmail.com>
anshulpundir added a commit that referenced this pull request Sep 22, 2017
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
anshulpundir added a commit that referenced this pull request Sep 22, 2017
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants