Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot TTL expiry causes Primary to be marked unhealthy #59

Closed
tgross opened this issue Sep 23, 2016 · 3 comments
Closed

Snapshot TTL expiry causes Primary to be marked unhealthy #59

tgross opened this issue Sep 23, 2016 · 3 comments
Assignees

Comments

@tgross
Copy link
Contributor

tgross commented Sep 23, 2016

@misterbisson has reported that the on_change handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.

This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.

We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The onChange handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:

docker run -d -p 8500:8500 --name consul_server \
       progrium/consul:latest \
       -server -bootstrap -ui-dir /ui

docker run -it -p 8501:8500 --name consul_agent \
       -v $(pwd)/containerpilot.json:/etc/containerpilot.json \
       -v $(pwd)/on_change.sh:/on_change.sh \
       --link consul_server:consul \
       autopilotpattern/mysql \
       /usr/local/bin/containerpilot \
       /usr/local/bin/consul agent -data-dir=/data -config-dir=/config \
                                   -rejoin -retry-join consul -retry-max 10 \
                                   -retry-interval 10s

The minimal containerpilot config we're binding into the agent is:

{
  "consul": "localhost:8500",
  "logging": {
    "level": "DEBUG"
  },
  "services": [
    {
      "name": "mysql",
      "port": 3306,
      "health": "echo health",
      "poll": 2,
      "ttl": 5
    }
  ],
  "backends": [
    {
      "name": "mysql",
      "poll": 7,
      "onChange": "curl -s http://localhost:8500/v1/health/service/mysql"
    }
  ]
}

We'll then register a new health check with the agent for the backup, and mark it passing once:

# push new check to agent
docker exec -it consul_agent \
       curl -s -d '{"Name": "backup", "TTL": "10s"}' \
       http://localhost:8500/v1/agent/check/register

# push pass TTL to agent
docker exec -it consul_agent \
       curl -v -s http://localhost:8500/v1/agent/check/pass/backup

After 10 seconds, the TTL for "backup" will expire and the on_change handler will fire!

2016/09/23 18:59:31 health
    2016/09/23 18:59:31 [WARN] agent: Check 'backup' missed TTL, is now critical
    2016/09/23 18:59:31 [INFO] agent: Synced check 'backup'
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 62:
2016/09/23 18:59:33 health
2016/09/23 18:59:33 mysql.health.run complete
2016/09/23 18:59:33 mysql.health.RunWithTimeout end
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 63:
2016/09/23 18:59:33 [{"Node":{"Node":"1a9e1eb44133","Address":"172.17.0.3","TaggedAddresses":null,"CreateIndex":0,"ModifyIndex":0},"Service":{"ID":"mysql-1a9e1eb44133","Service":"mysql","Tags":null,"Address":"172.17.0.3","Port":3306,"EnableTagOverride":false,"CreateIndex":0,"ModifyIndex":0},"Checks":[{"Node":"1a9e1eb44133","CheckID":"mysql-1a9e1eb44133","Name":"mysql-1a9e1eb44133","Status":"passing","Notes":"TTL for mysql set by containerpilot","Output":"ok","ServiceID":"mysql-1a9e1eb44133","ServiceName":"mysql","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"backup","Name":"backup","Status":"critical","Notes":"","Output":"TTL expired","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0}]}]

When this happens we can check the status of the mysql service with curl -s http://localhost:8500/v1/health/service/mysql | jq . and see the following:

[
  {
    "Node": {
      "Node": "1a9e1eb44133",
      "Address": "172.17.0.3"
    },
    "Service": {
      "ID": "mysql-1a9e1eb44133",
      "Service": "mysql",
      "Tags": null,
      "Address": "172.17.0.3",
      "Port": 3306
    },
    "Checks": [
      {
        "Node": "1a9e1eb44133",
        "CheckID": "mysql-1a9e1eb44133",
        "Name": "mysql-1a9e1eb44133",
        "Status": "passing",
        "Notes": "TTL for mysql set by containerpilot",
        "Output": "ok",
        "ServiceID": "mysql-1a9e1eb44133",
        "ServiceName": "mysql"
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": ""
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "backup",
        "Name": "backup",
        "Status": "critical",
        "Notes": "",
        "Output": "TTL expired",
        "ServiceID": "",
        "ServiceName": ""
      }
    ]
  }
]

Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.

The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.


In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is TritonDataCenter/containerpilot#178 (comment) so I'm going to hop on that ASAP.

@tgross tgross added the bug label Sep 23, 2016
@tgross tgross self-assigned this Sep 23, 2016
@tgross tgross changed the title Snapshot TTL causes Primary to be marked unhealthy Snapshot TTL expiry causes Primary to be marked unhealthy Sep 23, 2016
@tgross
Copy link
Contributor Author

tgross commented Sep 27, 2016

Here's the design change I'm going to make to cover this bug, in the manage.py snapshot_task:

  • Attempt to obtain a local file lock on the snapshot process. (Exit on fail.)
  • Replica nodes ask Consul for the value of the LAST_BACKUP key. (This value will be a timestamp.) Continue only if the difference between that value and the current time is more than the snapshot interval.
  • Attempt to obtain a lock on the BACKUP_LOCK key in Consul for running the snapshot, with a TTL equal to the BACKUP_TTL. Exit on fail. If the node gets the lock:
    • snapshot the DB
    • push to Manta
    • write the current timestamp to the LAST_BACKUP key in Consul
    • release the BACKUP_LOCK key from Consul
  • Release the local file lock

@tgross
Copy link
Contributor Author

tgross commented Sep 28, 2016

The above was implemented in #61

@tgross
Copy link
Contributor Author

tgross commented Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant