Skip to content

Commit

Permalink
clean-up fabio service entries in Consul on dirty exit (#664)
Browse files Browse the repository at this point in the history
* consul: fix nits

Signed-off-by: Pires <pjpires@gmail.com>

* docs: add registry.consul.register.deregistercriticalserviceafter option

Signed-off-by: Pires <pjpires@gmail.com>

* register: clean-up fabio service entries in Consul on dirty exit

This is achieved by:
- Adding a TTL check to any services fabio may register into Consul, including
  its own;
- Repurpose the goroutine (that exists per service that fabio registers) that
  periodically guarantees the service is registered in Consul to also reset the
  respective TTL check clock;
- Set a short deadline in Consul to remove the service in case the TTL check is
  `critical` after said deadline is expired (time is non-deterministic as fabio
  doesn't control when the Consul reaper runs);

Signed-off-by: Pires <pjpires@gmail.com>

* changelog: mention #663 and #664

Signed-off-by: Pires <pjpires@gmail.com>
  • Loading branch information
pires authored and aaronhurt committed Jul 8, 2019
1 parent 9c92965 commit ea3a365
Show file tree
Hide file tree
Showing 5 changed files with 80 additions and 13 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@

Thanks to [@murphymj25](https://github.com/murphymj25) for the patch.

* [PR #664](https://github.com/fabiolb/fabio/issues/664): Clean-up fabio service entries in Consul on dirty exit

In the case fabio dies abruptly, the steps to deregister any fabio-related services in Consul will not take place.
In certain cases this could result in duplicate service and check entries after fabio restarts, especially if fabio runs in a Docker container.

This PR addresses this issue by registering an additional TTL check that acts as a deadman switch and removes abandoned service registrations eventually.

Thanks to [@pires](https://github.com/pires) for the [report](https://github.com/fabiolb/fabio/issues/663) and the fix.

* [PR #669](https://github.com/fabiolb/fabio/pull/669): Add option for downgrading tracing IDs to 64 bit

When tracing is enabled, fabio injected 128 bit root span IDs if necessary.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: "registry.consul.register.deregisterCriticalServiceAfter"
---

`registry.consul.register.deregisterCriticalServiceAfter` configures the time the
service health check is allowed to be in state `critical` until Consul automatically
deregisters it.
If fabio is still running, the service will be re-registered almost immediately after
being deleted by Consul.

At the time of this writing, Consul enforces a minimum value of one minute and runs
its reaper process every thirty seconds.

The default is

registry.consul.register.deregisterCriticalServiceAfter = 90m
3 changes: 1 addition & 2 deletions registry/consul/backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -99,11 +99,10 @@ func (b *be) Deregister(service string) error {

func (b *be) DeregisterAll() error {
log.Printf("[DEBUG]: consul: Deregistering all registered aliases.")
for name, dereg := range b.dereg {
for _, dereg := range b.dereg {
if dereg == nil {
continue
}
log.Printf("[INFO] consul: Deregistering %q", name)
dereg <- true // trigger deregistration
<-dereg // wait for completion
}
Expand Down
63 changes: 53 additions & 10 deletions registry/consul/register.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ import (
"github.com/hashicorp/consul/api"
)

const (
TTLInterval = time.Second * 15
TTLRefreshInterval = time.Second * 10
TTLDeregisterCriticalServiceAfter = time.Minute
)

// register keeps a service registered in consul.
//
// When a value is sent in the dereg channel the service is deregistered from
Expand Down Expand Up @@ -46,31 +52,43 @@ func register(c *api.Client, service *api.AgentServiceRegistration) chan bool {
log.Printf("[INFO] consul: Registered fabio with id %q", service.ID)
log.Printf("[INFO] consul: Registered fabio with address %q", service.Address)
log.Printf("[INFO] consul: Registered fabio with tags %q", strings.Join(service.Tags, ","))
log.Printf("[INFO] consul: Registered fabio with health check to %q", service.Check.HTTP)
for _, check := range service.Checks {
log.Printf("[INFO] consul: Registered fabio with check %+v", check)
}

return service.ID
}

deregister := func(serviceID string) {
log.Printf("[INFO] consul: Deregistering %s", service.Name)
log.Printf("[INFO] consul: Deregistering %q", service.Name)
c.Agent().ServiceDeregister(serviceID)
}

passTTL := func(serviceTTLID string) {
c.Agent().UpdateTTL(serviceTTLID, "", api.HealthPassing)
}

dereg := make(chan bool)
go func() {
var serviceID string
var serviceTTLCheckId string

for {
if !registered(serviceID) {
serviceID = register()
serviceTTLCheckId = computeServiceTTLCheckId(serviceID)
// Pass the TTL check right now so traffic can be served immediately.
passTTL(serviceTTLCheckId)
}

select {
case <-dereg:
deregister(serviceID)
dereg <- true
return
case <-time.After(time.Second):
// continue
case <-time.After(TTLRefreshInterval):
// Reset the TTL check clock.
passTTL(serviceTTLCheckId)
}
}
}()
Expand Down Expand Up @@ -115,14 +133,39 @@ func serviceRegistration(cfg *config.Consul, serviceName string) (*api.AgentServ
Address: ip.String(),
Port: port,
Tags: cfg.ServiceTags,
Check: &api.AgentServiceCheck{
HTTP: checkURL,
Interval: cfg.CheckInterval.String(),
Timeout: cfg.CheckTimeout.String(),
TLSSkipVerify: cfg.CheckTLSSkipVerify,
DeregisterCriticalServiceAfter: cfg.CheckDeregisterCriticalServiceAfter,
// Set the checks for the service.
//
// Both checks must pass for Consul to consider the service healthy and therefore serve the fabio instance to clients.
Checks: []*api.AgentServiceCheck{
// If fabio doesn't exit cleanly, it doesn't auto-deregister itself from Consul.
// In order to address this, we introduce a TTL check to prove the fabio instance is alive and able to route this service.
// The TTL check must be refreshed before its timeout is crossed.
// If the timeout is crossed, the check fails.
// If the check fails, Consul considers this service to have become unhealthy.
// If the check is failing (critical) after DeregisterCriticalServiceAfter is elapsed, the Consul reaper will remove it from Consul.
// For more info, read https://www.consul.io/api/agent/check.html#deregistercriticalserviceafter.
{
CheckID: computeServiceTTLCheckId(serviceID),
TTL: TTLInterval.String(),
DeregisterCriticalServiceAfter: TTLDeregisterCriticalServiceAfter.String(),
},
// HTTP check is meant to confirm fabio health endpoint is reachable from the Consul agent.
// If the check fails, Consul considers this service to have become unhealthy.
// If the check fails and registry.consul.register.deregisterCriticalServiceAfter is set, the service will be deregistered from Consul.
// For more info, read https://www.consul.io/api/agent/check.html#deregistercriticalserviceafter.
{
HTTP: checkURL,
Interval: cfg.CheckInterval.String(),
Timeout: cfg.CheckTimeout.String(),
TLSSkipVerify: cfg.CheckTLSSkipVerify,
DeregisterCriticalServiceAfter: cfg.CheckDeregisterCriticalServiceAfter,
},
},
}

return service, nil
}

func computeServiceTTLCheckId(serviceID string) string {
return strings.Join([]string{serviceID, "ttl"}, "-")
}
2 changes: 1 addition & 1 deletion registry/consul/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ func NewServiceMonitor(client *api.Client, config *config.Consul, dc string) *Se
}

// Watch monitors the consul health checks and sends a new
// configuration to the updates channnel on every change.
// configuration to the updates channel on every change.
func (w *ServiceMonitor) Watch(updates chan string) {
var lastIndex uint64
for {
Expand Down

0 comments on commit ea3a365

Please sign in to comment.