Snapshot TTL expiry causes Primary to be marked unhealthy #59

tgross · 2016-09-23T19:33:30Z

@misterbisson has reported that the on_change handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.

This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.

We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The onChange handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:

docker run -d -p 8500:8500 --name consul_server \
       progrium/consul:latest \
       -server -bootstrap -ui-dir /ui

docker run -it -p 8501:8500 --name consul_agent \
       -v $(pwd)/containerpilot.json:/etc/containerpilot.json \
       -v $(pwd)/on_change.sh:/on_change.sh \
       --link consul_server:consul \
       autopilotpattern/mysql \
       /usr/local/bin/containerpilot \
       /usr/local/bin/consul agent -data-dir=/data -config-dir=/config \
                                   -rejoin -retry-join consul -retry-max 10 \
                                   -retry-interval 10s

The minimal containerpilot config we're binding into the agent is:

{
  "consul": "localhost:8500",
  "logging": {
    "level": "DEBUG"
  },
  "services": [
    {
      "name": "mysql",
      "port": 3306,
      "health": "echo health",
      "poll": 2,
      "ttl": 5
    }
  ],
  "backends": [
    {
      "name": "mysql",
      "poll": 7,
      "onChange": "curl -s http://localhost:8500/v1/health/service/mysql"
    }
  ]
}

We'll then register a new health check with the agent for the backup, and mark it passing once:

# push new check to agent
docker exec -it consul_agent \
       curl -s -d '{"Name": "backup", "TTL": "10s"}' \
       http://localhost:8500/v1/agent/check/register

# push pass TTL to agent
docker exec -it consul_agent \
       curl -v -s http://localhost:8500/v1/agent/check/pass/backup

After 10 seconds, the TTL for "backup" will expire and the on_change handler will fire!

2016/09/23 18:59:31 health
    2016/09/23 18:59:31 [WARN] agent: Check 'backup' missed TTL, is now critical
    2016/09/23 18:59:31 [INFO] agent: Synced check 'backup'
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 62:
2016/09/23 18:59:33 health
2016/09/23 18:59:33 mysql.health.run complete
2016/09/23 18:59:33 mysql.health.RunWithTimeout end
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 63:
2016/09/23 18:59:33 [{"Node":{"Node":"1a9e1eb44133","Address":"172.17.0.3","TaggedAddresses":null,"CreateIndex":0,"ModifyIndex":0},"Service":{"ID":"mysql-1a9e1eb44133","Service":"mysql","Tags":null,"Address":"172.17.0.3","Port":3306,"EnableTagOverride":false,"CreateIndex":0,"ModifyIndex":0},"Checks":[{"Node":"1a9e1eb44133","CheckID":"mysql-1a9e1eb44133","Name":"mysql-1a9e1eb44133","Status":"passing","Notes":"TTL for mysql set by containerpilot","Output":"ok","ServiceID":"mysql-1a9e1eb44133","ServiceName":"mysql","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"backup","Name":"backup","Status":"critical","Notes":"","Output":"TTL expired","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0}]}]

When this happens we can check the status of the mysql service with curl -s http://localhost:8500/v1/health/service/mysql | jq . and see the following:

[
  {
    "Node": {
      "Node": "1a9e1eb44133",
      "Address": "172.17.0.3"
    },
    "Service": {
      "ID": "mysql-1a9e1eb44133",
      "Service": "mysql",
      "Tags": null,
      "Address": "172.17.0.3",
      "Port": 3306
    },
    "Checks": [
      {
        "Node": "1a9e1eb44133",
        "CheckID": "mysql-1a9e1eb44133",
        "Name": "mysql-1a9e1eb44133",
        "Status": "passing",
        "Notes": "TTL for mysql set by containerpilot",
        "Output": "ok",
        "ServiceID": "mysql-1a9e1eb44133",
        "ServiceName": "mysql"
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": ""
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "backup",
        "Name": "backup",
        "Status": "critical",
        "Notes": "",
        "Output": "TTL expired",
        "ServiceID": "",
        "ServiceName": ""
      }
    ]
  }
]

Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.

The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.

In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is TritonDataCenter/containerpilot#178 (comment) so I'm going to hop on that ASAP.

The text was updated successfully, but these errors were encountered:

tgross · 2016-09-27T17:51:09Z

Here's the design change I'm going to make to cover this bug, in the manage.py snapshot_task:

Attempt to obtain a local file lock on the snapshot process. (Exit on fail.)
Replica nodes ask Consul for the value of the LAST_BACKUP key. (This value will be a timestamp.) Continue only if the difference between that value and the current time is more than the snapshot interval.
Attempt to obtain a lock on the BACKUP_LOCK key in Consul for running the snapshot, with a TTL equal to the BACKUP_TTL. Exit on fail. If the node gets the lock:
- snapshot the DB
- push to Manta
- write the current timestamp to the LAST_BACKUP key in Consul
- release the BACKUP_LOCK key from Consul
Release the local file lock

tgross · 2016-09-28T18:55:48Z

The above was implemented in #61

tgross · 2016-10-03T13:56:21Z

Released in https://github.com/autopilotpattern/mysql/releases/tag/5.6r3.1.0

tgross added the bug label Sep 23, 2016

tgross self-assigned this Sep 23, 2016

tgross changed the title ~~Snapshot TTL causes Primary to be marked unhealthy~~ Snapshot TTL expiry causes Primary to be marked unhealthy Sep 23, 2016

tgross mentioned this issue Sep 27, 2016

Update ContainerPilot to 2.4.2 #60

Merged

tgross added the open PR label Sep 28, 2016

tgross mentioned this issue Sep 28, 2016

Record timestamp for last snapshot rather than overloading service #61

Merged

tgross added awaiting release and removed open PR labels Sep 29, 2016

tgross closed this as completed Oct 3, 2016

tgross added released and removed awaiting release labels Oct 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot TTL expiry causes Primary to be marked unhealthy #59

Snapshot TTL expiry causes Primary to be marked unhealthy #59

tgross commented Sep 23, 2016

tgross commented Sep 27, 2016

tgross commented Sep 28, 2016

tgross commented Oct 3, 2016

Snapshot TTL expiry causes Primary to be marked unhealthy #59

Snapshot TTL expiry causes Primary to be marked unhealthy #59

Comments

tgross commented Sep 23, 2016

tgross commented Sep 27, 2016

tgross commented Sep 28, 2016

tgross commented Oct 3, 2016