Skip to content

Conversation

@thraxil
Copy link
Contributor

@thraxil thraxil commented Sep 26, 2017

Adds proper handling of signals.

On SIGTERM or SIGINT, it cleanly shuts down the goroutine that's querying the alerts and then does a graceful shutdown of the HTTP server (allowing it to finish serving in-flight requests and close the TCP connections first).

On SIGHUP, it does the same clean shutdown process, then re-reads the config file (note: it doesn't reload environment variables), and restarts things in place.

This relies on functionality added to Go 1.7 (context library) and 1.8 (support for graceful shutdown on the HTTP server), so it now will only compile with Go 1.8+. The centurylink/golang-builder docker image that was being used is stuck on 1.5 and its corresponding github repo hasn't been touched in two years, so I also switched it to use the plain golang:1.8 docker image.

Since I know you're not really Go programmers, and this relies on some pretty Go specific constructs, here's a bit of explanation of how the tricky bits work:

First, with:

sigs := make(chan os.Signal, 1)
signal.Notify(sigs, os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

It's setting up a channel, sigs, that can be passed os.Signals. That is passed to signal.Notify, along with a list of the signals that we're interested in being notified about. When one of those signals is seen, it is passed onto the sigs channel.

A couple lines later, it does:

signal := <-sigs

<- is the "read from channel" operator. So that reads a signal value from the sigs channel and assigns it to signal. If a channel is empty, <- blocks until a value shows up. Ie, the main thread of execution in the program pretty much sits here waiting for the program to get one of the three signal types that we care about. Meanwhile the alertsCollection and HTTP server are running in background goroutines and doing their thing, so it's not like the program is hung. Once we get one of those signals, we proceed to shut things down, and, if it's a SIGHUP, reload the config and restart or, just exit if it's one of the other signals.

The other tricky part is the context stuff. You can think of a context in Go as a cancellable thing that can be chained together into structures, which can then all be cancelled at once. They are commonly used in Go for timeouts, deadlines, and explicit cancellation. It's probably helpful to think of a context as both the context and an "entangled" cancel function. If the cancel function is called, the associated context (and any child contexts that have been derived from it) knows that it has been cancelled.

The HTTP graceful shutdown API takes a context parameter to enable a hard timeout. Here:

ctx, cancel := context.WithTimeout(bgcontext, 1*time.Second)
... s.Shutdown(ctx); ...

A context ctx is created that will automatically be cancelled after one second. It is passed to the Shutdown() call. It will spend up to a second trying to finish serving in-flight requests and close connections, but no longer. The later call to cancel() explicitly makes sure the context has been cancelled, even if it never timed out (not strictly necessary, but good practice).

For the alertsCollection, which is also running in the background, polling Graphite every interval, we also pass it a context, and the associated cancel function is returned to our main thread as alertscancel. When we call alertscancel(), it cancels the context that the alertsCollection is holding onto.

alertsCollection's main loop is in Run(). Previously, it just was a plain loop that would poll graphite, then sleep for an interval and repeat forever. It still does that, but now there's a select in there with two cases. select on multiple empty channels in Go blocks until one of the channels has a value. context.Done() returns a value that context has been cancelled. So as soon as alertscancel() was called back in the main goroutine, that unblocks and it can break out of the loop. The other channel that the select is trying to read from is now coming from time.After(), which just pushes a value on the channel after a specified duration, so that's simply taking the place of the old sleep. If the context is cancelled while it is off polling graphite or sending out alert emails, it will complete that work before it gets back to the select and sees that the context is cancelled. It would be possible to pass the context down to those functions so they could abort themselves immediately (and I might send that commit later), but it's probably fine to let it finish out its complete cycle.

`centurylink/ca-certs` is stuck on 1.5 and hasn't been updated in two
years :(

graceful shutdown of the http server requires Go 1.8 or higher.

This switches the base image to `golang:1.8`. The resulting image is a
bit larger, but should otherwise work the same.
nothing from the config file is used until we start up services
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.3%) to 28.81% when pulling ad4b75f on thraxil:handle-signals into 1044bab on ccnmtl:master.

@sdreher sdreher self-assigned this Sep 28, 2017
@sdreher sdreher merged commit a54887e into ccnmtl:master Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants