-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manager: Always run the watch server #2323
manager: Always run the watch server #2323
Conversation
The watch server was wrongly changed to only run on the leader node. It needs to run on all managers because this is one of the RPC services that is not proxied to the leader (since all nodes receive events through Raft). Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Codecov Report
@@ Coverage Diff @@
## master #2323 +/- ##
==========================================
- Coverage 60.27% 60.23% -0.04%
==========================================
Files 128 128
Lines 25972 25972
==========================================
- Hits 15654 15645 -9
- Misses 8918 8946 +28
+ Partials 1400 1381 -19 |
@@ -491,6 +491,10 @@ func (m *Manager) Run(parent context.Context) error { | |||
healthServer.SetServingStatus("Raft", api.HealthCheckResponse_NOT_SERVING) | |||
localHealthServer.SetServingStatus("ControlAPI", api.HealthCheckResponse_NOT_SERVING) | |||
|
|||
if err := m.watchServer.Start(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: probably doesn't matter much, but would it make sense to start this after the raft node has started up and joined (when the other watches are set)? Otherwise if something starts watching right away, would they get the events from the raft node starting up and loading all of its state from disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, where in the code would you suggest? I don't think the end of this startup process is signaled to higher-level code, but I may be forgetting something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, there's no guarantee or anything. But we start the other watches at https://github.com/docker/swarmkit/pull/2323/files#diff-8077df928eb040c7c69eea83f15e3c9dL542, after we've finished loading raft state from disk and we've observed a leader and cluster state - possibly starting the watch server can also happen at that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion, but I'm worried it would trigger the error here, which is not recoverable:
I'd prefer to just fix the regression in this PR, and a future change that's coordinated with some added retry/backoff in moby/moby could delay starting the watch server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, sounds good.
LGTM! |
- moby/swarmkit#2323 (fix for watch server being run only on leader) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) - moby/swarmkit#2323 (run watchapi server on all managers) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) - moby/swarmkit#2323 (run watchapi server on all managers) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2323 (fix for watch server being run only on leader) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) - moby/swarmkit#2323 (run watchapi server on all managers) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2323 (fix for watch server being run only on leader) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2323 (fix for watch server being run only on leader) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2323 (fix for watch server being run only on leader) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) - moby/swarmkit#2323 (run watchapi server on all managers) Signed-off-by: Ying <ying.li@docker.com>
The watch server was wrongly changed to only run on the leader node. It needs to run on all managers because this is one of the RPC services that is not proxied to the leader (since all nodes receive events through Raft).
This is regression introduced by #2310 (sorry). It never made it into the moby tree. It is covered by tests in that tree.
cc @cyli @aluzzardi