You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@misterbisson has reported that the on_change handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.
This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.
We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The onChange handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:
After 10 seconds, the TTL for "backup" will expire and the on_change handler will fire!
2016/09/23 18:59:31 health
2016/09/23 18:59:31 [WARN] agent: Check 'backup' missed TTL, is now critical
2016/09/23 18:59:31 [INFO] agent: Synced check 'backup'
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 62:
2016/09/23 18:59:33 health
2016/09/23 18:59:33 mysql.health.run complete
2016/09/23 18:59:33 mysql.health.RunWithTimeout end
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 63:
2016/09/23 18:59:33 [{"Node":{"Node":"1a9e1eb44133","Address":"172.17.0.3","TaggedAddresses":null,"CreateIndex":0,"ModifyIndex":0},"Service":{"ID":"mysql-1a9e1eb44133","Service":"mysql","Tags":null,"Address":"172.17.0.3","Port":3306,"EnableTagOverride":false,"CreateIndex":0,"ModifyIndex":0},"Checks":[{"Node":"1a9e1eb44133","CheckID":"mysql-1a9e1eb44133","Name":"mysql-1a9e1eb44133","Status":"passing","Notes":"TTL for mysql set by containerpilot","Output":"ok","ServiceID":"mysql-1a9e1eb44133","ServiceName":"mysql","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"backup","Name":"backup","Status":"critical","Notes":"","Output":"TTL expired","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0}]}]
When this happens we can check the status of the mysql service with curl -s http://localhost:8500/v1/health/service/mysql | jq . and see the following:
Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.
The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.
In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is TritonDataCenter/containerpilot#178 (comment) so I'm going to hop on that ASAP.
The text was updated successfully, but these errors were encountered:
Here's the design change I'm going to make to cover this bug, in the manage.py snapshot_task:
Attempt to obtain a local file lock on the snapshot process. (Exit on fail.)
Replica nodes ask Consul for the value of the LAST_BACKUP key. (This value will be a timestamp.) Continue only if the difference between that value and the current time is more than the snapshot interval.
Attempt to obtain a lock on the BACKUP_LOCK key in Consul for running the snapshot, with a TTL equal to the BACKUP_TTL. Exit on fail. If the node gets the lock:
snapshot the DB
push to Manta
write the current timestamp to the LAST_BACKUP key in Consul
@misterbisson has reported that the
on_change
handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.
We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The
onChange
handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:The minimal containerpilot config we're binding into the agent is:
We'll then register a new health check with the agent for the backup, and mark it passing once:
After 10 seconds, the TTL for "backup" will expire and the
on_change
handler will fire!When this happens we can check the status of the mysql service with
curl -s http://localhost:8500/v1/health/service/mysql | jq .
and see the following:Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.
The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.
In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is TritonDataCenter/containerpilot#178 (comment) so I'm going to hop on that ASAP.
The text was updated successfully, but these errors were encountered: