Shutdown servers when resetting cluster membership#295
Conversation
e80421f to
bc4314c
Compare
|
Hi, thanks for the PR. Have you seen #198? Could you please make the functionality dependent on this field? |
Hmm, there might also be the question of how long to wait for connections to drain. Currently, we're waiting 5 seconds, but that could be configurable. We could have |
Sounds good to me. By default (in case |
0902544 to
87196d3
Compare
Added the Also, the original intention was to solve the |
7aa3265 to
62764f9
Compare
roosterfish
left a comment
There was a problem hiding this comment.
Thank you for this.
Good call also allowing to set the timeout for the core listeners.
Please have a look at my few comments.
| if err != nil { | ||
| return err | ||
| } | ||
| err = d.endpoints.Shutdown() |
There was a problem hiding this comment.
Shouldn't we rather extend Down() instead of adding another Shutdown()? As we are already looping through the listeners in Down() you can also call endpoint.ShutdownServer() from there.
There was a problem hiding this comment.
Indeed. That was my original implementation, as you can see from the fact that this separation was introduced in the second commit (will squash before merging).
However, the reason behind this is a bit complicated. First of all, Down() shutting down the listeners is a good idea, it allows us to no longer accept new connections, and it would be best to call it early in such cases. But we can't necessarily close down the servers just yet, as this may be called due to a request (e.g.: if bootstrap fails (aka during a request), we're "resetting" the cluster). Because this happens during a request, we cannot gracefully complete the request and shutdown the server on the same thread (or at least with how microcluster's code flows), which is why reverter.Fail() was goroutined. If we forcefully shutdown the servers while a connection is still in progress, the client will get an EOF and with no explanation of what actually happens (it doesn't get the error we're supposed to send back, which is the original reason this PR was sent).
We could go back to how I've done this previously, and having Down() also shutdown the servers as you suggest, but the same thread response + shutdown issue I've mentioned could still occur, like it does in clusterMemberPut (the listeners are stopped before writing the reply).
There was a problem hiding this comment.
If I understand it right this is now fixed as you run StopListeners within the reExec go routine?
| // Running the revert actions in a goroutine will address this issue: while the revert happens, | ||
| // we'll be able to return and write the HTTP response and then close the connection, finally | ||
| // allowing the Servers to gracefully shutdown, and the clients to be happy. | ||
| go reverter.Fail() |
There was a problem hiding this comment.
Is this working just because the sending of the response is faster than the revert? I fear we require some form of synchronization so that it works consistently.
There was a problem hiding this comment.
Can we do something like this to flush the response beforehand?
There was a problem hiding this comment.
Indeed, we're doing this to allow the request to actually finish and the response to be written back before closing down the server. The reverter also includes shutting down the server, so we need to finish our request beforehand.
The ManualResponse bit from clusterMemberPut is interesting. Note that resetClusterMember is called before actually getting to write any response. If resetClusterMember would result in the servers to shutdown before getting to write the response, it would result in another EOF error. There is a way in which the ManualResponse may help... though it will be a bit odd: stop the servers in the ManualResponse.
b1d7762 to
5cb95f0
Compare
41bbf5d to
d26cd86
Compare
roosterfish
left a comment
There was a problem hiding this comment.
Can you please again squash the commits to clean up the history? Please also check my comment on the endpoint ctx.
d26cd86 to
69c7a9d
Compare
Squashed the commits. I won't change the We'd have that call duplicated, and it would have been worse if there were more error cases. Having a |
|
Thanks for squashing.
That's a good point so let's keep the defer where it is
This is right but doesn't account for the call to ff := func() string {
return "hello from ff"
}
ff2 := func() string {
defer fmt.Println("hello from ff2")
return ff()
}
fmt.Println(ff2())I would propose to just run |
|
Sorry maybe I wasn't clear enough with the squashing. I saw your comment here and thought you will squash only the changes which are anyway reverted in one of the follow up commits. We group the changes by package (indicated as prefix in the commit message). Meaning your changes to the daemon's config would be in a commit message like "internal/daemon: Adding DrainConnectionsTimeout field to config" and so on. |
69c7a9d to
b6c2a37
Compare
Yes, it does print
Done. |
roosterfish
left a comment
There was a problem hiding this comment.
You are right on the defer, my example doesn't make sense as it returns the value which changes the effective order of the prints. Forget about it :)
Almost there, just two smaller comments.
b6c2a37 to
b526e5d
Compare
|
Seeing these errors in the logs now: Amended |
Hi, I have ran some extended tests and it looks like we face two other issue here.
This is because as soon as the daemon receives the signal, it cancels its main So we have to detach
|
b526e5d to
bf7c167
Compare
Right. I initially had
If we are going to use a separated context ( |
roosterfish
left a comment
There was a problem hiding this comment.
Thanks for the efforts, LGTM!
|
@claudiubelu wanted to merge now but saw the commits aren't signed. Could you please sign them? |
Adds configuration option for draining server connections before shutting it down. Signed-off-by: Claudiu Belu <claudiu.belu@canonical.com>
Adds configuration option for draining core server connections before shutting it down. Signed-off-by: Claudiu Belu <claudiu.belu@canonical.com>
When closing a listener, the server won't accept any new connections. However, we should still allow for the current connections / requests to finish. Calling ``server.Shutdown`` allows us to wait for them to finish. We'll be doing this only if drainConnectionsTimeout has been set for the server. Note that server.Serve always returns an error, and it will return http.ErrServerClosed if server.Shutdown or server.Close was called, and net.ErrClosed if the listener was closed. Signed-off-by: Claudiu Belu <claudiu.belu@canonical.com>
Note that in the case we fail to bootstrap / join the cluster, we'll be resetting a few things, including the cluster membership. This includes the HTTPS and unix socket servers we have open by closing them. However, we cannot gracefully shutdown the servers, as there's at least one connection that is still open: the bootstrap / join request. Forcing the connection to close before we're able to write the request response will result in the client getting an EOF error, and no information regarding the failure. Shutting down the server in a goroutine will address this issue: we'll be able to return and write the HTTP response and then close the connection, finally allowing the Servers to gracefully shutdown, and the clients to be happy. Signed-off-by: Claudiu Belu <claudiu.belu@canonical.com>
Signed-off-by: Claudiu Belu <claudiu.belu@canonical.com>
bf7c167 to
a4b91df
Compare
Done, Sorry, I forgot to sign them. :) |
This addresses a regression introduced with #295. Before you were able to close endpoints (listeners) without closing the underlying server. As part of allowing graceful shutdown of servers (and listeners), we added code to implicitly shutdown the listeners underlying server for proper cleanup. However it looks the code always assumed the underlying server never gets closed which allows instructing a remote Microcluster member to join the cluster without closing the connection that was used to instruct this member to join the cluster. The PR introduces another flag that explicitly allows shutting down an endpoint's (listener's) underlying server. We can do this in most of the cases but should not do it in the following two: * Close the server when a remote client instructed to join an existing cluster * Close the server when restarting one of the extension servers. This allows keeping active connections alive.
In the
k8s-snap, we're seeing a significant amount ofEOFerrors client-side, typically when operations such ask8s bootstrapork8s join-clusterare run and an error occurs server-side.When closing a listener, the server won't accept any new connections. However, we should still allow for the current connections / requests to finish. Calling
server.Shutdownallows us to wait for them to finish. We'll be doing this only ifdrainConnectionsTimeouthas been set for the server.Note that in the case we fail to bootstrap / join the cluster, we'll be resetting a few things, including the cluster membership. This includes the HTTPS and unix socket servers we have open by closing them.
However, we cannot gracefully shutdown the servers, as there's at least one connection that is still open: the bootstrap / join request. Forcing the connection to close before we're able to write the request response will result in the client getting an EOF error, and no information regarding the failure.
Running the revert actions in a goroutine will address this issue: while the revert happens, we'll be able to return and write the HTTP response and then close the connection, finally allowing the Servers to gracefully shutdown, and the clients to be happy. As a bonus, this will also significantly reduce the amount of time spent by the bootstrap / join requests in case of failures.