New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set grpc max message size to 128MB. #2375

Merged
merged 1 commit into from Sep 19, 2017

Conversation

Projects
None yet
8 participants
@anshulpundir
Contributor

anshulpundir commented Sep 15, 2017

Signed-off-by: Anshul Pundir anshul.pundir@docker.com

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Sep 15, 2017

Codecov Report

Merging #2375 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2375      +/-   ##
==========================================
+ Coverage   60.35%   60.45%   +0.09%     
==========================================
  Files         128      128              
  Lines       26267    26268       +1     
==========================================
+ Hits        15854    15880      +26     
+ Misses       9011     8993      -18     
+ Partials     1402     1395       -7

codecov bot commented Sep 15, 2017

Codecov Report

Merging #2375 into master will increase coverage by 0.09%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2375      +/-   ##
==========================================
+ Coverage   60.35%   60.45%   +0.09%     
==========================================
  Files         128      128              
  Lines       26267    26268       +1     
==========================================
+ Hits        15854    15880      +26     
+ Misses       9011     8993      -18     
+ Partials     1402     1395       -7
@nishanttotla

I'd like to know the full side effects of doing this. My hunch is that message encoding gets affected for larger message size limits. Also related: https://stackoverflow.com/questions/34128872/google-protobuf-maximum-size

Ping @stevvooe.

@wsong

This comment has been minimized.

Show comment
Hide comment
@wsong

wsong Sep 15, 2017

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

wsong commented Sep 15, 2017

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Sep 15, 2017

Contributor

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

Agreed @wsong

I'd like to know the full side effects of doing this.

I'd like to also :)

My hunch is that message encoding gets affected for larger message size limits.

I think encoding and decoding depends on actual message size and the structure of the data. For example, a deeply nested struct would be more costly to encode/decode. I don't think changing the limit affects encoding/decoding performance. @nishanttotla

Contributor

anshulpundir commented Sep 15, 2017

I'd also recommend that we make this configurable. The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

Agreed @wsong

I'd like to know the full side effects of doing this.

I'd like to also :)

My hunch is that message encoding gets affected for larger message size limits.

I think encoding and decoding depends on actual message size and the structure of the data. For example, a deeply nested struct would be more costly to encode/decode. I don't think changing the limit affects encoding/decoding performance. @nishanttotla

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Sep 18, 2017

Contributor

The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

I'm not sure about that - when moving to the "proper" solution (streaming), the config option won't make sense anymore.

I don't know if it makes sense to introduce a new option in a patch release just so we can deprecate it immediately after.

Contributor

aluzzardi commented Sep 18, 2017

The default value for this parameter can stay at 128MB or whatever, but it'd be nice if we read this value out of /etc/docker/daemon.json and passed it into the swarmkit code in the moby/moby repo.

I'm not sure about that - when moving to the "proper" solution (streaming), the config option won't make sense anymore.

I don't know if it makes sense to introduce a new option in a patch release just so we can deprecate it immediately after.

Show outdated Hide outdated manager/manager.go Outdated
@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Sep 18, 2017

Contributor

Overall looks good

What is going to happen when we reach 128M? Any way we could log a proper error message?

Contributor

aluzzardi commented Sep 18, 2017

Overall looks good

What is going to happen when we reach 128M? Any way we could log a proper error message?

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Sep 18, 2017

Contributor

I think some log messages should be included to bubble up errors from hitting the gRPC limit (either in this PR or as an immediate follow up).

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition? If we do implement streaming later, then we'd want to bring the limit down again. This may cause backward compatibility issues.

I agree with @aluzzardi that if we introduce this option, it doesn't make sense to immediately deprecate it.

Contributor

nishanttotla commented Sep 18, 2017

I think some log messages should be included to bubble up errors from hitting the gRPC limit (either in this PR or as an immediate follow up).

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition? If we do implement streaming later, then we'd want to bring the limit down again. This may cause backward compatibility issues.

I agree with @aluzzardi that if we introduce this option, it doesn't make sense to immediately deprecate it.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Sep 18, 2017

Contributor

We just met and agreed to not add the knob to adjust the max grpc message size.
I will manually test the positive test case today and investigate adding an integration test for positive and negative scenarios.

Contributor

anshulpundir commented Sep 18, 2017

We just met and agreed to not add the knob to adjust the max grpc message size.
I will manually test the positive test case today and investigate adding an integration test for positive and negative scenarios.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Sep 18, 2017

Contributor

-1 on making this configurable.

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition?

Yes, this is a horrible thing to have as a scaling limit and it needs to be addressed. Besides patching this as a workaround, the next focus should be actually fixing this to handle arbitrary snapshot sizes. Increasing the message size limit and calling it good is not an acceptable engineering solution. Increasing this limit exposes us to message floods on all services, not just the raft service.

Is there anything we can add to log messages that are over some threshold, so that we can have idea of the level of impact this is having? Prom counters bucketed by size might be appropriate. Such large messages will have an effect on stability at around 10MB and up with protobuf. The allocations required for processing a 128MB message may cause stability issues. I'd really like to know if there are messages that are larger than 1MB transiting swarm GRPC endpoints, as those are likely causing "buffer bloat" and allocation spikes. We really need to have an easy way for users to understand when they are impacted by this problem.

Contributor

stevvooe commented Sep 18, 2017

-1 on making this configurable.

If we increase the limit to 128MB, then is it worth the engineering effort to implement streaming in addition?

Yes, this is a horrible thing to have as a scaling limit and it needs to be addressed. Besides patching this as a workaround, the next focus should be actually fixing this to handle arbitrary snapshot sizes. Increasing the message size limit and calling it good is not an acceptable engineering solution. Increasing this limit exposes us to message floods on all services, not just the raft service.

Is there anything we can add to log messages that are over some threshold, so that we can have idea of the level of impact this is having? Prom counters bucketed by size might be appropriate. Such large messages will have an effect on stability at around 10MB and up with protobuf. The allocations required for processing a 128MB message may cause stability issues. I'd really like to know if there are messages that are larger than 1MB transiting swarm GRPC endpoints, as those are likely causing "buffer bloat" and allocation spikes. We really need to have an easy way for users to understand when they are impacted by this problem.

@wsong

This comment has been minimized.

Show comment
Hide comment
@wsong

wsong Sep 18, 2017

Yeah, after meeting offline, I agree that this should not be configurable.

Adding a metric to track large (i.e. > 1MB) GRPC messages is definitely a good idea, but it should probably go in a separate PR. Logging the length of every single message before sending it might have some unforeseen performance implications.

wsong commented Sep 18, 2017

Yeah, after meeting offline, I agree that this should not be configurable.

Adding a metric to track large (i.e. > 1MB) GRPC messages is definitely a good idea, but it should probably go in a separate PR. Logging the length of every single message before sending it might have some unforeseen performance implications.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Sep 18, 2017

Contributor

Logging the length of every single message before sending it might have some unforeseen performance implications.

No really what I was suggesting:

if len(msg) > 1 << 20 {
 // log it.
}

Better would be to have bucketed counters (unrolled below):

if len(msg) >= 1 << 20 {
  gt1MB++
}

if len(msg) >= 4 << 20 {
  gt4MB++
}

if len(msg) >= 10 << 20 {
  gt10MB++
}

if len(msg) >= 16 << 20 {
  gt16MB++
}
...

Notice that this is a cumulative counter setup, rather than bucketed, which allows easy aggregates.

probably go in a separate PR.

Sure, but let's make sure it gets backported: right now, this condition is hard to diagnose and debug.

Contributor

stevvooe commented Sep 18, 2017

Logging the length of every single message before sending it might have some unforeseen performance implications.

No really what I was suggesting:

if len(msg) > 1 << 20 {
 // log it.
}

Better would be to have bucketed counters (unrolled below):

if len(msg) >= 1 << 20 {
  gt1MB++
}

if len(msg) >= 4 << 20 {
  gt4MB++
}

if len(msg) >= 10 << 20 {
  gt10MB++
}

if len(msg) >= 16 << 20 {
  gt16MB++
}
...

Notice that this is a cumulative counter setup, rather than bucketed, which allows easy aggregates.

probably go in a separate PR.

Sure, but let's make sure it gets backported: right now, this condition is hard to diagnose and debug.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Sep 19, 2017

Contributor

Manually tested for now. Positive test-case for around 120Mib. Negative for around 240Mib. Will explore the metrics and error reporting on a separate PR.

Contributor

anshulpundir commented Sep 19, 2017

Manually tested for now. Positive test-case for around 120Mib. Negative for around 240Mib. Will explore the metrics and error reporting on a separate PR.

@andrewhsu

This comment has been minimized.

Show comment
Hide comment
@andrewhsu

andrewhsu Sep 19, 2017

Hmm...looks like the linter changed a few hours ago: golang/lint#319

I'm seeing these errors from the circleci job:

🐳 lint
ca/keyreadwriter.go:190:2: redundant if ...; err != nil check, just return error instead.
manager/allocator/network.go:175:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/network.go:99:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:59:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:164:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:484:2: redundant if ...; err != nil check, just return error instead.
manager/dispatcher/dispatcher.go:857:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:405:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:527:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:658:3: redundant if ...; err != nil check, just return error instead.
manager/orchestrator/update/updater.go:387:4: redundant if ...; err != nil check, just return error instead.
manager/scheduler/scheduler.go:95:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1286:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1851:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/storage/storage.go:229:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/transport/transport.go:238:2: redundant if ...; err != nil check, just return error instead.
make: *** [lint] Error 1

andrewhsu commented Sep 19, 2017

Hmm...looks like the linter changed a few hours ago: golang/lint#319

I'm seeing these errors from the circleci job:

🐳 lint
ca/keyreadwriter.go:190:2: redundant if ...; err != nil check, just return error instead.
manager/allocator/network.go:175:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/network.go:99:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:59:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:164:2: redundant if ...; err != nil check, just return error instead.
manager/controlapi/service.go:484:2: redundant if ...; err != nil check, just return error instead.
manager/dispatcher/dispatcher.go:857:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:405:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:527:3: redundant if ...; err != nil check, just return error instead.
manager/logbroker/broker_test.go:658:3: redundant if ...; err != nil check, just return error instead.
manager/orchestrator/update/updater.go:387:4: redundant if ...; err != nil check, just return error instead.
manager/scheduler/scheduler.go:95:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1286:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/raft.go:1851:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/storage/storage.go:229:2: redundant if ...; err != nil check, just return error instead.
manager/state/raft/transport/transport.go:238:2: redundant if ...; err != nil check, just return error instead.
make: *** [lint] Error 1
@marcusmartins

This comment has been minimized.

Show comment
Hide comment
@marcusmartins

marcusmartins Sep 19, 2017

Member

@anshulpundir @nishanttotla the CI is failing on lint. Do you know why?

Member

marcusmartins commented Sep 19, 2017

@anshulpundir @nishanttotla the CI is failing on lint. Do you know why?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Sep 19, 2017

Member

Can we pin the linter to a specific version to prevent that?

Member

thaJeztah commented Sep 19, 2017

Can we pin the linter to a specific version to prevent that?

@wsong

This comment has been minimized.

Show comment
Hide comment
@wsong

wsong Sep 19, 2017

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

wsong commented Sep 19, 2017

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Sep 19, 2017

Contributor

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

Doing this right now.

Contributor

anshulpundir commented Sep 19, 2017

Alternately, it may be faster to just open up a separate PR to fix those places in the code; it looks like there might not be too many.

Doing this right now.

Set grpc max message size to 128MB.
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

@anshulpundir anshulpundir merged commit b1bcc05 into docker:master Sep 19, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 60.45% (target 0%)
Details
dco-signed All commits are signed

@anshulpundir anshulpundir deleted the anshulpundir:msg_size branch Sep 19, 2017

andrewhsu added a commit to andrewhsu/swarmkit that referenced this pull request Sep 19, 2017

Set grpc max message size to 128MB. (docker#2375)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>(cherry picked from commit b1bcc05)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

andrewhsu added a commit to andrewhsu/swarmkit that referenced this pull request Sep 19, 2017

Set grpc max message size to 128MB. (docker#2375)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>(cherry picked from commit b1bcc05)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

anshulpundir added a commit that referenced this pull request Sep 19, 2017

[17.06] backport fixes for max grpc message size and lint errors (#2378)
* Fix linter errors.

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit aa2c48b)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

* Set grpc max message size to 128MB. (#2375)

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>(cherry picked from commit b1bcc05)
Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

* run vndr again because it now deletes unused files

Signed-off-by: Andrew Hsu <andrewhsu@docker.com>

vieux added a commit that referenced this pull request Sep 19, 2017

Set grpc max message size to 128MB. (#2375)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Victor Vieux <victorvieux@gmail.com>

anshulpundir added a commit that referenced this pull request Sep 22, 2017

Set grpc max message size to 128MB. (#2375)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

anshulpundir added a commit that referenced this pull request Sep 22, 2017

Set grpc max message size to 128MB. (#2375)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit b1bcc05)
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment