New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase gRPC request timeout to 20 seconds for sending snapshots #2391

Merged
merged 1 commit into from Oct 9, 2017

Conversation

Projects
None yet
8 participants
@nishanttotla
Contributor

nishanttotla commented Oct 2, 2017

Since #2375 was merged, we've been getting context deadline exceeded while sending large raft snapshots. After investigation from @anshulpundir and I, it seems that the gRPC context timeout was too short at 2 seconds to be sufficient for sending large snapshots. This PR increases the default value to 45 seconds. Testing from @davidwilliamson confirms that this fixes the issue.

cc @anshulpundir @aluzzardi @stevvooe @aaronlehmann @marcusmartins @mghazizadeh

Signed-off-by: Nishant Totla nishanttotla@gmail.com

@dperny

This comment has been minimized.

Show comment
Hide comment
@dperny

dperny Oct 2, 2017

Member

i'm pretty sure that's going to mean it takes 45 seconds to recognize that a raft member is down.

Member

dperny commented Oct 2, 2017

i'm pretty sure that's going to mean it takes 45 seconds to recognize that a raft member is down.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Oct 2, 2017

Contributor

I think @dperny's concern is worth investigation.

Contributor

stevvooe commented Oct 2, 2017

I think @dperny's concern is worth investigation.

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 2, 2017

Contributor

@dperny let me test this to make sure, but your concern is legitimate. This is only the worst case though right? Does changing this timeout mean we need to adjust other Raft settings?

@davidwilliamson in your tests, is it possible to check for @dperny's concern above?

Contributor

nishanttotla commented Oct 2, 2017

@dperny let me test this to make sure, but your concern is legitimate. This is only the worst case though right? Does changing this timeout mean we need to adjust other Raft settings?

@davidwilliamson in your tests, is it possible to check for @dperny's concern above?

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Oct 2, 2017

Codecov Report

Merging #2391 into master will increase coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2391      +/-   ##
==========================================
+ Coverage    60.5%   60.68%   +0.18%     
==========================================
  Files         128      128              
  Lines       26319    26332      +13     
==========================================
+ Hits        15923    15980      +57     
+ Misses       9006     8958      -48     
- Partials     1390     1394       +4

codecov bot commented Oct 2, 2017

Codecov Report

Merging #2391 into master will increase coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2391      +/-   ##
==========================================
+ Coverage    60.5%   60.68%   +0.18%     
==========================================
  Files         128      128              
  Lines       26319    26332      +13     
==========================================
+ Hits        15923    15980      +57     
+ Misses       9006     8958      -48     
- Partials     1390     1394       +4
@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 2, 2017

Contributor

@dperny why did you claim that detecting down Raft members will take longer to detect when down? Is the "heartbeat" done through a gRPC call that times out at 2 seconds?

Anyway, based on a discussion with @anshulpundir, I've updated this PR to only increase the time limit for when we send a snapshot. This is a little hacky perhaps, but should avoid other issues.

I would still suggest that @davidwilliamson test this change out before we consider merging it. I will create a build shortly.

Contributor

nishanttotla commented Oct 2, 2017

@dperny why did you claim that detecting down Raft members will take longer to detect when down? Is the "heartbeat" done through a gRPC call that times out at 2 seconds?

Anyway, based on a discussion with @anshulpundir, I've updated this PR to only increase the time limit for when we send a snapshot. This is a little hacky perhaps, but should avoid other issues.

I would still suggest that @davidwilliamson test this change out before we consider merging it. I will create a build shortly.

@anshulpundir

Spoke with @nishanttotla offline about making this a more targeted to only affect snap send messages.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Oct 2, 2017

Contributor

Oops, didn't see @nishanttotla previous message, where he already mentioned this. Please ignore.

Contributor

anshulpundir commented Oct 2, 2017

Oops, didn't see @nishanttotla previous message, where he already mentioned this. Please ignore.

@dperny

This comment has been minimized.

Show comment
Hide comment
@dperny

dperny Oct 2, 2017

Member

PTAL, this is less of a hack, but might be over-engineered. WDYT? @nishanttotla @anshulpundir

master...dperny:increase-snapshot-timeout

Member

dperny commented Oct 2, 2017

PTAL, this is less of a hack, but might be over-engineered. WDYT? @nishanttotla @anshulpundir

master...dperny:increase-snapshot-timeout

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Oct 2, 2017

Contributor

It would be good to see that this is tunable.

Do we have a PR for the real solution?

Contributor

stevvooe commented Oct 2, 2017

It would be good to see that this is tunable.

Do we have a PR for the real solution?

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 2, 2017

Contributor

@dperny looks alright to me, I think that is the right way to engineer it. The only qualm I have with doing this is that if we introduce this LargeSendTimeout option, then it just stays, even after we use streaming to send large snapshots.

What we also need to figure out is whether the core change (in peer.go) is the right one to begin with.

Contributor

nishanttotla commented Oct 2, 2017

@dperny looks alright to me, I think that is the right way to engineer it. The only qualm I have with doing this is that if we introduce this LargeSendTimeout option, then it just stays, even after we use streaming to send large snapshots.

What we also need to figure out is whether the core change (in peer.go) is the right one to begin with.

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Oct 2, 2017

Contributor

Had the same concern as @dperny - I see it got addressed, looks good to me.

@stevvooe Agree with the tunable part generally speaking, however I'd hope we'd get rid of this workaround very soon and therefore the tunable option would become useless.

Can we do a round of manual testing to validate this? Specifically, raft down nodes with and without the patch as well as sending a large snapshot.

Not sure what to think about the current value of 45 seconds. Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit, which I think is a reasonable expectation?

There are a few edge cases but I don't think we can really address those. I believe the raft state machine counts any message for heart beat purposes. If the peer goes down just as we're sending a snapshot, it would take 43 additional seconds to detect it's down. But that's really a edge case and we can't really work around that, can we?

Contributor

aluzzardi commented Oct 2, 2017

Had the same concern as @dperny - I see it got addressed, looks good to me.

@stevvooe Agree with the tunable part generally speaking, however I'd hope we'd get rid of this workaround very soon and therefore the tunable option would become useless.

Can we do a round of manual testing to validate this? Specifically, raft down nodes with and without the patch as well as sending a large snapshot.

Not sure what to think about the current value of 45 seconds. Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit, which I think is a reasonable expectation?

There are a few edge cases but I don't think we can really address those. I believe the raft state machine counts any message for heart beat purposes. If the peer goes down just as we're sending a snapshot, it would take 43 additional seconds to detect it's down. But that's really a edge case and we can't really work around that, can we?

@davidwilliamson

This comment has been minimized.

Show comment
Hide comment
@davidwilliamson

davidwilliamson Oct 2, 2017

@aluzzardi re: "Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit"

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Is the 45 second timeout (and thus the bandwidth calculation) per snapshot, or for the entire history?

davidwilliamson commented Oct 2, 2017

@aluzzardi re: "Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit"

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Is the 45 second timeout (and thus the bandwidth calculation) per snapshot, or for the entire history?

@marcusmartins

This comment has been minimized.

Show comment
Hide comment
@marcusmartins

marcusmartins Oct 2, 2017

Member

@davidwilliamson --max-snapshot only control how many snapshots to keep locally, they don't get synced over the wire so they don't influence that calculation.

Member

marcusmartins commented Oct 2, 2017

@davidwilliamson --max-snapshot only control how many snapshots to keep locally, they don't get synced over the wire so they don't influence that calculation.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Oct 2, 2017

Contributor

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Only the latest snapshot is sent @davidwilliamson

Contributor

anshulpundir commented Oct 2, 2017

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Only the latest snapshot is sent @davidwilliamson

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Oct 2, 2017

Contributor

+1 on making this tunable, since the previous value of 2s for the timeout also seems arbitrary.

Contributor

anshulpundir commented Oct 2, 2017

+1 on making this tunable, since the previous value of 2s for the timeout also seems arbitrary.

@nishanttotla nishanttotla changed the title from Increase gRPC request timeout to 45 seconds to Increase gRPC request timeout to 20 seconds for sending snapshots Oct 2, 2017

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 2, 2017

Contributor

Added a SendTimeoutSnapshot field to NodeOptions so that the value can be made configurable in the future.

Also reduced the value of the large timeout to 20 seconds. Assuming a 100Mbps connection, about 240MB can be sent in a period of 20 seconds. This is well over our allowed limit.

Contributor

nishanttotla commented Oct 2, 2017

Added a SendTimeoutSnapshot field to NodeOptions so that the value can be made configurable in the future.

Also reduced the value of the large timeout to 20 seconds. Assuming a 100Mbps connection, about 240MB can be sent in a period of 20 seconds. This is well over our allowed limit.

SendTimeout time.Duration
TLSCredentials credentials.TransportCredentials
KeyRotator EncryptionKeyRotator
SendTimeout time.Duration

This comment has been minimized.

@anshulpundir

anshulpundir Oct 3, 2017

Contributor

It would be nice to be able to modify this without a code change, for testing etc. Is that currently possible ?

@anshulpundir

anshulpundir Oct 3, 2017

Contributor

It would be nice to be able to modify this without a code change, for testing etc. Is that currently possible ?

This comment has been minimized.

@nishanttotla

nishanttotla Oct 3, 2017

Contributor

That would have to be wired through the CLI and require more changes. I don't think that should be part of this PR.

@nishanttotla

nishanttotla Oct 3, 2017

Contributor

That would have to be wired through the CLI and require more changes. I don't think that should be part of this PR.

This comment has been minimized.

@anshulpundir

anshulpundir Oct 3, 2017

Contributor

I wasn't recommending that it be a part of this PR. But, I was curious if there is an easy way to add customizable options to swarmkit. Is the answer no ? Also, is CLI the only way to do it ?

@anshulpundir

anshulpundir Oct 3, 2017

Contributor

I wasn't recommending that it be a part of this PR. But, I was curious if there is an easy way to add customizable options to swarmkit. Is the answer no ? Also, is CLI the only way to do it ?

This comment has been minimized.

@nishanttotla

nishanttotla Oct 3, 2017

Contributor

It's certainly possible (and not hard) to make this configurable at the SwarmKit level, but assuming that it'll mostly be used through the Docker API means it's not very valuable unless it's done end-to-end. This is just my opinion.

@nishanttotla

nishanttotla Oct 3, 2017

Contributor

It's certainly possible (and not hard) to make this configurable at the SwarmKit level, but assuming that it'll mostly be used through the Docker API means it's not very valuable unless it's done end-to-end. This is just my opinion.

Show outdated Hide outdated manager/state/raft/transport/mock_raft_test.go
@aaronlehmann

This looks good to me after the outstanding comments are addressed.

The same problem might exist for AppendEntries. Snapshots are most likely to take a long time to send, but sending a lot of entries outside a snapshot could take a long time too.

I think it is safe to increase the timeout for the general case. It's true that it might take a leader longer to notice that a follower node is down, but this case isn't particularly important. It's more important when the leader goes down and some other node needs to start a leader election, but this is handled by the Raft state machine and I don't think the timeout has any influence on this. Of course, it would be a good idea to test any changes here carefully.

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 3, 2017

Contributor

@aaronlehmann for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I understand that increasing the overall timeout may work out, but given that we're able to be specific here, I think we should only increase timeout where necessary.

Contributor

nishanttotla commented Oct 3, 2017

@aaronlehmann for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I understand that increasing the overall timeout may work out, but given that we're able to be specific here, I think we should only increase timeout where necessary.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Oct 3, 2017

Contributor

are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

Even if all the entries are sent together, does increasing the grpc message size have any affect on this ? @nishanttotla @aaronlehmann Basically I'm curious why you feel that the timeout for that also needs to be increased.

Contributor

anshulpundir commented Oct 3, 2017

are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

Even if all the entries are sent together, does increasing the grpc message size have any affect on this ? @nishanttotla @aaronlehmann Basically I'm curious why you feel that the timeout for that also needs to be increased.

@anshulpundir

This comment has been minimized.

Show comment
Hide comment
@anshulpundir

anshulpundir Oct 3, 2017

Contributor

The MaxSizePerMsg, which limits the number of entries in an append message, is set to 64K. Depending on the size of each entry, a larger grpc message size can lead to larger append messages.

Also, depending on the size of each entry, MaxSizePerMsg should probably be lowered.

Contributor

anshulpundir commented Oct 3, 2017

The MaxSizePerMsg, which limits the number of entries in an append message, is set to 64K. Depending on the size of each entry, a larger grpc message size can lead to larger append messages.

Also, depending on the size of each entry, MaxSizePerMsg should probably be lowered.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Oct 3, 2017

Contributor

LGTM

Contributor

stevvooe commented Oct 3, 2017

LGTM

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Oct 4, 2017

Collaborator

for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I think it would be fine to split these across multiple gRPC requests if we detect that a lot of data would be sent. It would also be okay to increase the timeout. Whatever's easiest.

Collaborator

aaronlehmann commented Oct 4, 2017

for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I think it would be fine to split these across multiple gRPC requests if we detect that a lot of data would be sent. It would also be okay to increase the timeout. Whatever's easiest.

@anshulpundir

We can increase the timeout for specifically for append entries message, if thats possible. Otherwise, reducing MaxSizePerMsg seems simpler.

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 5, 2017

Contributor

Based on a chat with @anshulpundir, we have decided to pursue the case of AppendEntries in a follow-up PR. We think more discussion is needed for that.

Contributor

nishanttotla commented Oct 5, 2017

Based on a chat with @anshulpundir, we have decided to pursue the case of AppendEntries in a follow-up PR. We think more discussion is needed for that.

Increase gRPC request timeout to 20 seconds when sending snapshots
Signed-off-by: Nishant Totla <nishanttotla@gmail.com>
@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 6, 2017

Contributor

After more discussion, @anshulpundir and I have decided to just increase the timeout for AppendEntries, since it seems right, and we still have the upper limit of 128MB for the gRPC message. I've updated that. I think we can now merge this PR.

cc @aaronlehmann

Contributor

nishanttotla commented Oct 6, 2017

After more discussion, @anshulpundir and I have decided to just increase the timeout for AppendEntries, since it seems right, and we still have the upper limit of 128MB for the gRPC message. I've updated that. I think we can now merge this PR.

cc @aaronlehmann

@nishanttotla nishanttotla merged commit 1e80cfb into docker:master Oct 9, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 60.68% (target 0%)
Details
dco-signed All commits are signed

@nishanttotla nishanttotla deleted the nishanttotla:increase-grpc-timeout branch Oct 9, 2017

@nishanttotla

This comment has been minimized.

Show comment
Hide comment
@nishanttotla

nishanttotla Oct 9, 2017

Contributor

@thaJeztah @andrewhsu we must cherry pick this PR along with the gRPC limit increase. This must go into 17.09 as well.

Contributor

nishanttotla commented Oct 9, 2017

@thaJeztah @andrewhsu we must cherry pick this PR along with the gRPC limit increase. This must go into 17.09 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment