Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ca]: Fix GetRemoteSignedCertificate retry #2063

Merged
merged 1 commit into from
Mar 30, 2017

Conversation

cyli
Copy link
Contributor

@cyli cyli commented Mar 28, 2017

@aaronlehmann discovered an error where if an external CA was unreachable, a new node join would attempt to call IssueNodeCertificate on the remote CA 3 times at 5 second intervals, resulting in 3 different node entries. The assumption would be that GetRemoteSignedCertificate would be called, which would call IssueNodeCertificate on the remote CA once, and then just keep calling NodeCertificateStatus.

The issue seems to be that we had a context timeout of 5 seconds on the call to NodeCertificateStatus. However, if the call to NodeCertificateStatus errors (which it would with the timeout), rather than retry NodeCertificateStatus later, GetRemoteSignedCertificate just returns.

This PR updates GetRemoteSignedCertificate so if the call to NodeCertificateStatus times out after 5 seconds, it retries after an exponential backoff. If it errors due to some other reason (not timeout), it will close the existing connection and open a new one, and try again. This means that GetRemoteSignedCertificate could take a very long time, which is I think the expected behavior.

@cyli cyli force-pushed the fix-node-certificate-reiussue-retry branch from 602714c to 4a94b21 Compare March 28, 2017 06:49
@codecov
Copy link

codecov bot commented Mar 28, 2017

Codecov Report

Merging #2063 into master will increase coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2063      +/-   ##
==========================================
+ Coverage   54.21%   54.26%   +0.05%     
==========================================
  Files         111      111              
  Lines       19354    19366      +12     
==========================================
+ Hits        10492    10509      +17     
- Misses       7597     7603       +6     
+ Partials     1265     1254      -11

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74a3d19...b727e68. Read the comment docs.

@cyli cyli force-pushed the fix-node-certificate-reiussue-retry branch from 4a94b21 to 9564d69 Compare March 28, 2017 07:18
select {
case err = <-completed:
assert.Equal(t, grpc.Code(err), codes.DeadlineExceeded)
case <-time.After(570 * time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this time.Second meant to be time.Millisecond? And if so, is 70ms of slack enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the comment above says 5 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes it was supposed to be time.Millisecond. I can bump this up to 1 second.

@cyli cyli force-pushed the fix-node-certificate-reiussue-retry branch from 9564d69 to 1c7f89d Compare March 28, 2017 17:19
select {
case err = <-completed:
assert.Equal(t, grpc.Code(err), codes.DeadlineExceeded)
case <-time.After(1250 * time.Millisecond):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bump the timeout to a few seconds. You can never be too careful. Also please update the comment above, which I don't think is quite right.

select {
case <-completed:
assert.FailNow(t, "GetRemoteSignedCertificate should wait at least 3 seconds")
case <-time.After(1 * time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment above says 5 seconds, assertion failure message says 3 seconds...

@aaronlehmann
Copy link
Collaborator

cc @binman-docker

@aaronlehmann
Copy link
Collaborator

This is tricky because the client side has no way to distinguish an unresponsive manager from a CA that is taking a long time to issue the certificate. The client needs to decide whether to try reconnecting to another manager, but it doesn't have much information to use for making that decision.

If IssueNodeCertificate succeeded, it's unlikely that NodeCertificateStatus would be hanging due to the manager being unresponsive. So I think the current behavior in this PR of not closing the connection when these time out is probably correct.

time.Sleep(expBackoff.Proceed(nil))
select {
case <-ctx.Done():
conn.Close(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking the parameter to Close should be true, since it's really GetRemoteSignedCertificate that's timing out, and not necessarily anything wrong with the manager. But one could make an argument for keeping it as false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.

case err != nil: // this was a deadline exceeded error - we need to figure out which context
select { // the entire `GetRemoteSignedCertificate` call context was cancelled - return the error
case <-ctx.Done():
conn.Close(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann aaronlehmann added this to the 17.04.0 milestone Mar 28, 2017
@cyli cyli force-pushed the fix-node-certificate-reiussue-retry branch 5 times, most recently from fa5f8be to 3ac8569 Compare March 28, 2017 22:11
@cyli
Copy link
Contributor Author

cyli commented Mar 28, 2017

@aaronlehmann I think I've addressed all the comments, and updated GetRemoteSignedCertificates to close the existing connection and open a new one if NodeCertificateStatus fails. PTAL

@aaronlehmann
Copy link
Collaborator

LGTM

@aaronlehmann
Copy link
Collaborator

ping @diogomonica please review

@aaronlehmann
Copy link
Collaborator

Not a big deal, but the new unit tests will run twice:

https://github.com/docker/swarmkit/blob/71c51110949592cbc50f25ede769bd0e8f535993/ca/certificates_test.go#L52-L61

Maybe they should exit early if testutils.External is false?

@cyli
Copy link
Contributor Author

cyli commented Mar 29, 2017

@aaronlehmann makes sense, will skip if it's true (because we don't actually need an external CA server, and it's just one more listening port eating up resources :))

@aaronlehmann
Copy link
Collaborator

Thanks. Can you do it in #2064 as well?

There are probably a few existing ca tests that are being run redundantly. I might look into that later.

@cyli
Copy link
Contributor Author

cyli commented Mar 29, 2017

Yep, will do.


case err != nil: // this was a deadline exceeded error - we need to figure out which context
select { // the entire `GetRemoteSignedCertificate` call context was cancelled - return the error
case <-ctx.Done():
Copy link
Contributor

@diogomonica diogomonica Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check ctx.Done() in two places? here and 864?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can simplify this by adding err == nil && to the next case.

…in 5 seconds it errors

with a context deadline exceeded and does not retry.  Update it so that if the node has not
been updated within 5 seconds, attempt to get the node status again after an exponential
backoff.

If NodeCertificateStatus errors with some other error (not context deadline exceeded),
GetRemoteSignedCertificate will try again with a different connection.

Signed-off-by: cyli <ying.li@docker.com>
@cyli cyli force-pushed the fix-node-certificate-reiussue-retry branch from 6731b83 to b727e68 Compare March 29, 2017 23:18
@diogomonica diogomonica merged commit 74cb08b into moby:master Mar 30, 2017
@cyli cyli deleted the fix-node-certificate-reiussue-retry branch March 30, 2017 00:29
func TestGetRemoteSignedCertificateWithPending(t *testing.T) {
t.Parallel()
if testutils.External {
// we don't actually need an external signing server, since we're faking a CA TestCAServerUpdateRootCA which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the connection to TestCAServerUpdateRootCA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird, I think it was a failed search and replace :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants