New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap etcd data from a special dot #359

Open
prisamuel opened this Issue Mar 27, 2018 · 11 comments

Comments

Projects
None yet
3 participants
@prisamuel
Copy link
Contributor

prisamuel commented Mar 27, 2018

  • Backup etcd data to a "special" dot in dotmesh cluster. Bootstrapping system can notice the etc dot and an empty etcd, and restore data into etcd from the dot.

Recovery options:

  • automatic master election based on a timestamp where the etcd data is written to the nodes.
  • manual 'dm recover' where the user chooses which node to recover etcd data from. Preferred option, and avoids implementing raft ourselves.

@prisamuel prisamuel added the task label Mar 27, 2018

@alaric-dotmesh

This comment has been minimized.

Copy link
Collaborator

alaric-dotmesh commented Mar 27, 2018

With further thought, I think the local storage should just be a ZFS filesystem in the pool and not a dot. There's no point in replicating it with dotmesh when all nodes see the same etcd and snapshot it, we don't need an infinite history of commits, and the concurrency limit of writing to a dot might be a problem.

We should store a "backup snapshot counter" in etcd. Every node will periodically (whenever a "dirty" flag has been set, which is set after operations that update state we really care about; or N minutes after last time) dump all of etcd into a local ZFS filesystem, and atomically get+increment the snapshot counter in etcd, and write the snapshot counter and a wall-clock time into ZFS. The writes into ZFS should be done atomically, by writing into a new file called snapshot.tmp and then renaming it to snapshot when it's finished. The file should have a simple header with the snapshot counter and wall time in, a delimiter, than the raw etcd dump in JSON format.

On startup, the node should consider several cases.

Inputs:

  1. There is a snapshot file in local ZFS (because this zpool isn't new).
  2. There is a snapshot counter in etcd (because the etcd state already exists).

Cases:

  1. No snapshot counter in ZFS, nor etcd - new node on a new etcd - proceed as usual
  2. No snapshot counter in ZFS, one in etcd - new node added to existing cluster, proceed as usual
  3. Snapshot counter in ZFS, none in etcd - go into RECOVERY MODE
  4. Snapshot counter in ZFS and it's lower than the one in etcd - existing node returning to an existing cluster, proceed as usual
  5. Snapshot counter in ZFS and it's higher than the one in etcd - REFUSE TO START as etcd has been rolled back; require manual confirmation to start (which will cause the loss of our snapshot state in ZFS as the backup system later replaces it with an older one), or let the user nuke etcd and re-start recovery mode with this node present.

"RECOVERY MODE" is:

  1. Set a flag so that all RPCs (except special recovery ones explained below), all docker plugin actions, etc return a useful error message saying something like "This node is in recovery mode (link to docs about what to do)"
  2. Announce in etcd our snapshot counter and wall-clock time of last snapshot, filed under our server ID: /dotmesh-io/recovery/offers/SERVER ID
  3. Watch for /dotmesh-io/recovery/proceed to be created.
  4. If /dotmesh-io/recovery/proceed is set to our server ID, start a goroutine that:
    1. Reads our ZFS snapshot into etcd, except for anything under /dotmesh-io/recovery
    2. Sets /dotmesh-io/recovery/complete to something.
  5. Wait for /dotmesh-io/recovery/complete to be set to something.
  6. Unset the recovery flag, and proceed with normal operation.

The special recovery RPCs are just:

  1. DotmeshRPC.GetRecoveryStatus - returns the recovery flag, and everything under /dotmesh-io/recovery from etcd (but in a nice format).
  2. DotmeshRPC.InitiateRecovery - sets /dotmesh-io/recovery/proceed to a specified server ID.

The dm command line tool needs to have an interface that calls DotmeshRPC.GetRecoveryStatus and dumps the results. If we're waiting for proceed then it suggests the most recent (highest snapshot counter) node, and lets the user choose to set proceed to it with DotmeshRPC.InitiateRecovery, or to pick another node (for some reason) for DotmeshRPC.InitiateRecovery, or to do nothing and wait.

Have I missed anything?

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Mar 27, 2018

LGTM, make it so! :)

@alaric-dotmesh

This comment has been minimized.

Copy link
Collaborator

alaric-dotmesh commented Mar 27, 2018

The snapshot counter in etcd should be kept somewhere OTHER than /dotmesh-io/recovery, so that it gets restored from the snapshot. /dotmesh-io/recovery is purely for the running state of a recovery operation. However, we must ensure that the snapshot counter is the last thing to be recovered from the snapshot, just before /dotmesh-io/recovery/complete is set, as the snapshot counter's presence is the marker used by newly-starting nodes to tell if they need to go into recovery or not.

Let's consider edge cases of nodes starting up while other nodes are at various points in the process...

  1. Any node that starts before recovery is completed won't find a snapshot counter in etcd, so will go into recovery. If recovery is already proceeding they'll write their details into the registry of available snapshots and carry on to block for proceed then complete. That's OK.
  2. A node that starts just as recovery has ended, and sees the snapshot already there, starts up as usual.
  3. A node with a more recent snapshot than that picked to proceed with recovery just shrugs and runs with the older state. This might be a mistake, so here's a modification to the snapshot algorithm: When writing a snapshot to local ZFS storage, if the previous snapshot there is NEWER (higher snapshot counter) than the snapshot we're about to run, move it to a name that won't get overwritten (orphan-snapshot-at-COUNTER-TIMESTAMP) and log the fact so somebody knows it's there.
@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Mar 27, 2018

Does case 3 in #359 (comment) interact badly with case 5 in your "Cases" list in #359 (comment)?

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Mar 27, 2018

How do we coordinate unsetting the dirty flag if it's in etcd?

@lukemarsden lukemarsden reopened this Mar 27, 2018

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Mar 27, 2018

One of the benefits of storing the backup in a dot is that it can be pushed to a backup cluster as part of a backup that just involves pushing all the dots in a cluster to another cluster.

Maybe we should back up etcd to a dot and to a per-node etcd-backup filesystem for recovery mode?

This covers:
(a) loss of etcd but retaining (some of) the zfs filesystems
(b) complete cluster loss and recovery from off-site backup

Recovery mode and its UX should ideally be able to recover from a dot with some special marking if no etcd-backup exists?

@alaric-dotmesh

This comment has been minimized.

Copy link
Collaborator

alaric-dotmesh commented Mar 28, 2018

The dirty flag is per-node in RAM. That node sets it when it knows it's written something fun to etcd; it has to be per-node as it's unset when that node has done a snapshot. The goroutine that does the snapshots should clear the flag as soon as it's noticed it's set and started the snapshot, so if another event puts something cool in etcd while the snapshot is happening, a new snapshot will happen right after to make sure it's saved.

@alaric-dotmesh

This comment has been minimized.

Copy link
Collaborator

alaric-dotmesh commented Mar 28, 2018

Yeah, I like @lukemarsden's idea of putting the snapshots into a dot as a "second generation" while still having the simplicity and immediacy of a local dump! Let's say that the snapshot goroutine, if the current node is master for the snapshot dot, snapshots into it (and just leaves a marker in its snapshot ZFS filesystem pointing to the snapshot dot filesystem, which the recovery process follows - perhaps even a literal symlink...)

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Apr 3, 2018

Alternative: https://github.com/kopeio/etcd-manager

Update: doesn't run on Kube, so not suitable for our use-case.

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Apr 3, 2018

Alternative: see if upgrading etcd operator solves our woes.

Update: it doesn't.

@lukemarsden

This comment has been minimized.

Copy link
Collaborator

lukemarsden commented Apr 5, 2018

We've done a lot of the recovery work in rpc.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment