Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase gRPC request timeout to 20 seconds for sending snapshots #2391

Merged
merged 1 commit into from
Oct 9, 2017

Conversation

nishanttotla
Copy link
Contributor

Since #2375 was merged, we've been getting context deadline exceeded while sending large raft snapshots. After investigation from @anshulpundir and I, it seems that the gRPC context timeout was too short at 2 seconds to be sufficient for sending large snapshots. This PR increases the default value to 45 seconds. Testing from @davidwilliamson confirms that this fixes the issue.

cc @anshulpundir @aluzzardi @stevvooe @aaronlehmann @marcusmartins @mghazizadeh

Signed-off-by: Nishant Totla nishanttotla@gmail.com

@dperny
Copy link
Collaborator

dperny commented Oct 2, 2017

i'm pretty sure that's going to mean it takes 45 seconds to recognize that a raft member is down.

@stevvooe
Copy link
Contributor

stevvooe commented Oct 2, 2017

I think @dperny's concern is worth investigation.

@nishanttotla
Copy link
Contributor Author

@dperny let me test this to make sure, but your concern is legitimate. This is only the worst case though right? Does changing this timeout mean we need to adjust other Raft settings?

@davidwilliamson in your tests, is it possible to check for @dperny's concern above?

@codecov
Copy link

codecov bot commented Oct 2, 2017

Codecov Report

Merging #2391 into master will increase coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2391      +/-   ##
==========================================
+ Coverage    60.5%   60.68%   +0.18%     
==========================================
  Files         128      128              
  Lines       26319    26332      +13     
==========================================
+ Hits        15923    15980      +57     
+ Misses       9006     8958      -48     
- Partials     1390     1394       +4

@nishanttotla
Copy link
Contributor Author

@dperny why did you claim that detecting down Raft members will take longer to detect when down? Is the "heartbeat" done through a gRPC call that times out at 2 seconds?

Anyway, based on a discussion with @anshulpundir, I've updated this PR to only increase the time limit for when we send a snapshot. This is a little hacky perhaps, but should avoid other issues.

I would still suggest that @davidwilliamson test this change out before we consider merging it. I will create a build shortly.

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with @nishanttotla offline about making this a more targeted to only affect snap send messages.

@anshulpundir
Copy link
Contributor

Oops, didn't see @nishanttotla previous message, where he already mentioned this. Please ignore.

@dperny
Copy link
Collaborator

dperny commented Oct 2, 2017

PTAL, this is less of a hack, but might be over-engineered. WDYT? @nishanttotla @anshulpundir

master...dperny:increase-snapshot-timeout

@stevvooe
Copy link
Contributor

stevvooe commented Oct 2, 2017

It would be good to see that this is tunable.

Do we have a PR for the real solution?

@nishanttotla
Copy link
Contributor Author

@dperny looks alright to me, I think that is the right way to engineer it. The only qualm I have with doing this is that if we introduce this LargeSendTimeout option, then it just stays, even after we use streaming to send large snapshots.

What we also need to figure out is whether the core change (in peer.go) is the right one to begin with.

@aluzzardi
Copy link
Member

Had the same concern as @dperny - I see it got addressed, looks good to me.

@stevvooe Agree with the tunable part generally speaking, however I'd hope we'd get rid of this workaround very soon and therefore the tunable option would become useless.

Can we do a round of manual testing to validate this? Specifically, raft down nodes with and without the patch as well as sending a large snapshot.

Not sure what to think about the current value of 45 seconds. Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit, which I think is a reasonable expectation?

There are a few edge cases but I don't think we can really address those. I believe the raft state machine counts any message for heart beat purposes. If the peer goes down just as we're sending a snapshot, it would take 43 additional seconds to detect it's down. But that's really a edge case and we can't really work around that, can we?

@davidwilliamson
Copy link

@aluzzardi re: "Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit"

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Is the 45 second timeout (and thus the bandwidth calculation) per snapshot, or for the entire history?

@marcusmartins
Copy link

@davidwilliamson --max-snapshot only control how many snapshots to keep locally, they don't get synced over the wire so they don't influence that calculation.

@anshulpundir
Copy link
Contributor

What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., docker swarm update --max-snapshots 4 ?

Only the latest snapshot is sent @davidwilliamson

@anshulpundir
Copy link
Contributor

+1 on making this tunable, since the previous value of 2s for the timeout also seems arbitrary.

// adjust timeout to be higher for when a snapshot is being sent. This
// is to accommodate for the fact that snapshots can be large.
if m.Type == raftpb.MsgSnap {
timeout = 45 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define a constant for the default ?

@nishanttotla nishanttotla changed the title Increase gRPC request timeout to 45 seconds Increase gRPC request timeout to 20 seconds for sending snapshots Oct 2, 2017
@nishanttotla
Copy link
Contributor Author

Added a SendTimeoutSnapshot field to NodeOptions so that the value can be made configurable in the future.

Also reduced the value of the large timeout to 20 seconds. Assuming a 100Mbps connection, about 240MB can be sent in a period of 20 seconds. This is well over our allowed limit.

SendTimeout time.Duration
TLSCredentials credentials.TransportCredentials
KeyRotator EncryptionKeyRotator
SendTimeout time.Duration
Copy link
Contributor

@anshulpundir anshulpundir Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to modify this without a code change, for testing etc. Is that currently possible ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have to be wired through the CLI and require more changes. I don't think that should be part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't recommending that it be a part of this PR. But, I was curious if there is an easy way to add customizable options to swarmkit. Is the answer no ? Also, is CLI the only way to do it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly possible (and not hard) to make this configurable at the SwarmKit level, but assuming that it'll mostly be used through the Docker API means it's not very valuable unless it's done end-to-end. This is just my opinion.

Raft: mr,
HeartbeatInterval: 3 * time.Second,
SendTimeout: 2 * time.Second,
SendTimeoutSnapshot: 20 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still suggest making this (and others too) a constant and using it here and in raft.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants won't make this clearer, just harder to read.

Copy link
Contributor

@anshulpundir anshulpundir Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, its better to have it defined in one place than inline in many places. Agreed that that other place is test code, but still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that in a follow-up if we think it's required.

Copy link
Contributor

@anshulpundir anshulpundir Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a simple change, what do you think ? Also, I don't think there is a rush to get this in since 17.03 RC2 isn't waiting for this.

SendTimeout time.Duration
// SendTimeoutSnapshot is the timeout on the sending snapshots to other raft
// nodes. Leave this as 0 to get the default value.
SendTimeoutSnapshot time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be SendSnapshotTimeout.

Copy link
Collaborator

@aaronlehmann aaronlehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me after the outstanding comments are addressed.

The same problem might exist for AppendEntries. Snapshots are most likely to take a long time to send, but sending a lot of entries outside a snapshot could take a long time too.

I think it is safe to increase the timeout for the general case. It's true that it might take a leader longer to notice that a follower node is down, but this case isn't particularly important. It's more important when the leader goes down and some other node needs to start a leader election, but this is handled by the Raft state machine and I don't think the timeout has any influence on this. Of course, it would be a good idea to test any changes here carefully.

@nishanttotla
Copy link
Contributor Author

@aaronlehmann for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I understand that increasing the overall timeout may work out, but given that we're able to be specific here, I think we should only increase timeout where necessary.

@anshulpundir
Copy link
Contributor

are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

Even if all the entries are sent together, does increasing the grpc message size have any affect on this ? @nishanttotla @aaronlehmann Basically I'm curious why you feel that the timeout for that also needs to be increased.

@anshulpundir
Copy link
Contributor

The MaxSizePerMsg, which limits the number of entries in an append message, is set to 64K. Depending on the size of each entry, a larger grpc message size can lead to larger append messages.

Also, depending on the size of each entry, MaxSizePerMsg should probably be lowered.

@stevvooe
Copy link
Contributor

stevvooe commented Oct 3, 2017

LGTM

@aaronlehmann
Copy link
Collaborator

for AppendEntries, are all entries required to be sent in a single gRPC request? It might be worth increasing the timeout for that too.

I think it would be fine to split these across multiple gRPC requests if we detect that a lot of data would be sent. It would also be okay to increase the timeout. Whatever's easiest.

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can increase the timeout for specifically for append entries message, if thats possible. Otherwise, reducing MaxSizePerMsg seems simpler.

@nishanttotla
Copy link
Contributor Author

Based on a chat with @anshulpundir, we have decided to pursue the case of AppendEntries in a follow-up PR. We think more discussion is needed for that.

Signed-off-by: Nishant Totla <nishanttotla@gmail.com>
@nishanttotla
Copy link
Contributor Author

After more discussion, @anshulpundir and I have decided to just increase the timeout for AppendEntries, since it seems right, and we still have the upper limit of 128MB for the gRPC message. I've updated that. I think we can now merge this PR.

cc @aaronlehmann

@nishanttotla nishanttotla merged commit 1e80cfb into moby:master Oct 9, 2017
@nishanttotla nishanttotla deleted the increase-grpc-timeout branch October 9, 2017 13:12
@nishanttotla
Copy link
Contributor Author

@thaJeztah @andrewhsu we must cherry pick this PR along with the gRPC limit increase. This must go into 17.09 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants