Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When killed. Etcdv3 node can require manual intervention to bring back #7628

Closed
doodles526 opened this issue Mar 29, 2017 · 7 comments
Closed
Assignees
Labels
Milestone

Comments

@doodles526
Copy link

This was discovered when running a benchmark on a 3 node etcd cluster. The issue was only produced on a single node.

It appears that the BoltDB backend lags behind the snapshots written to disk, as a hard kill to an etcd member can result in etcdmain: database file (/data/etcd/member/snap/db index 7622690) does not match with snapshot (index 8081536). upon starting back up. After getting this error and the node starting to flap, you are able to fix the issue by deleting the latest snapshot and WAL file. It appears that after the snapshot is written to disk before the db file is written to, and that upon booting etcd doesn't have an automated method of recovery.

Log output

Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.287615 I | etcdserver: applying snapshot at index 7607808...
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.291440 I | etcdserver: saved snapshot at index 7599475
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.291867 I | etcdserver: compacted raft log at 7594475
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.295647 I | etcdserver: saved snapshot at index 7602829
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.295980 I | etcdserver: compacted raft log at 7597829
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.298568 I | etcdserver: saved snapshot at index 7600706
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.300390 I | etcdserver: saved snapshot at index 7607808
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.300554 I | etcdserver: compacted raft log at 7602808
Mar 28 21:44:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:24.302218 I | etcdserver: raft applied incoming snapshot at index 7622690
Mar 28 21:44:29 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:29.642527 I | etcdserver: recovering lessor...
Mar 28 21:44:29 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:29.648641 I | etcdserver: finished recovering lessor
Mar 28 21:44:29 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:29.648663 I | etcdserver: restoring mvcc store...
Mar 28 21:44:50 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:50.554203 I | pkg/fileutil: purged file /data/etcd/member/snap/000000000000007d-000000000073ddc2.snap successfully
Mar 28 21:44:50 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:50.554290 I | pkg/fileutil: purged file /data/etcd/member/snap/000000000000007d-000000000073e427.snap successfully
Mar 28 21:44:50 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:50.554345 I | pkg/fileutil: purged file /data/etcd/member/snap/000000000000007d-000000000073e822.snap successfully
Mar 28 21:44:50 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:50.554397 I | pkg/fileutil: purged file /data/etcd/member/snap/000000000000007d-000000000073ed71.snap successfully
Mar 28 21:44:50 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:50.554480 I | pkg/fileutil: purged file /data/etcd/member/snap/000000000000007d-000000000073f16f.snap successfully
Mar 28 21:44:53 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:44:53.814295 I | wal: segmented wal file /data/etcd/member/wal/0000000000000017-0000000000792046.wal is created
Mar 28 21:45:07 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:07.766653 I | rafthttp: receiving database snapshot [index:8081536, from f69252dd496581b1] ...
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915615 I | snap: saved database snapshot to disk [total bytes: 1589116928]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915661 I | rafthttp: received and saved database snapshot [index: 8081536, from: f69252dd496581b1] successfully
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915874 I | raft: 11561d69df4299ed [commit: 8006839, lastindex: 8006839, lastterm: 125] starts to restore snapshot [index: 8081536, term: 125]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915911 I | raft: log [committed=8006839, applied=8006839, unstable.offset=8006840, len(unstable.Entries)=0] starts to restore snapshot [index: 8081536, term: 125]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915947 I | raft: 11561d69df4299ed restored progress of 11561d69df4299ed [next = 8081537, match = 8081536, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915975 I | raft: 11561d69df4299ed restored progress of 1fd3856b78fe333e [next = 8081537, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.915992 I | raft: 11561d69df4299ed restored progress of f69252dd496581b1 [next = 8081537, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.916004 I | raft: 11561d69df4299ed [commit: 8081536] restored snapshot [index: 8081536, term: 125]
Mar 28 21:45:13 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:13.920984 I | etcdserver: raft applied incoming snapshot at index 8081536
Mar 28 21:45:16 aws-us-east-1-memory etcd697: etcd.19  | /app/etcd_runner.sh: line 6:    24 Killed                  etcd --initial-cluster-state new --initial-cluster-token $TOKEN --name etcd697.aws-us-east-1-memory.19.dblayer.com --data-dir /data/etcd --heartbeat-interval 500 --election-timeout 5000 --snapshot-count 1000 --listen-peer-urls http://10.197.80.130:2380 --advertise-client-urls 'http://10.197.80.130:2379' --initial-advertise-peer-urls http://10.197.80.130:2380 --listen-client-urls 'http://10.197.80.130:2379,http://127.0.0.1:2379' --initial-cluster 'etcd697.aws-us-east-1-memory.19.dblayer.com=http://10.197.80.130:2380,etcd675.aws-us-east-1-memory.18.dblayer.com=http://10.197.80.131:2380,etcd616.aws-us-east-1-memory.20.dblayer.com=http://10.197.80.132:2380'
Mar 28 21:45:16 aws-us-east-1-memory etcd697: start.13 | etcd.19 process exited.
Mar 28 21:45:16 aws-us-east-1-memory etcd697: start.13 | Stopping Cron Scheduler.
Mar 28 21:45:23 aws-us-east-1-memory etcd697: can't setuid.  syscall.Setuid(1500): operation not supported
Mar 28 21:45:23 aws-us-east-1-memory etcd697: running as uid 0
Mar 28 21:45:23 aws-us-east-1-memory etcd697: start.13 | Adding Procfile entry [Type: "DAEMON_PROCESS", Name: "etcd", Schedule: "", Command: "/app/etcd_runner.sh"].
Mar 28 21:45:23 aws-us-east-1-memory etcd697: start.13 | Starting etcd process.
Mar 28 21:45:23 aws-us-east-1-memory etcd697: start.13 | Starting Cron Scheduler.
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338478 I | etcdmain: etcd Version: 3.1.4
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338629 I | etcdmain: Git SHA: 41e52eb
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338638 I | etcdmain: Go Version: go1.7.5
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338643 I | etcdmain: Go OS/Arch: linux/amd64
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338649 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338698 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338787 I | embed: listening for peers on http://10.197.80.130:2380
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338862 I | embed: listening for client requests on 10.197.80.130:2379
Mar 28 21:45:23 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:23.338907 I | embed: listening for client requests on 127.0.0.1:2379
Mar 28 21:45:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:24.339415 W | etcdserver: another etcd process is running with the same data dir and holding the file lock.
Mar 28 21:45:24 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:24.339458 W | etcdserver: waiting for it to exit before starting...
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.851222 W | snap: skipped unexpected non snapshot file 000000000000007d-000000000060b62f.snap.broken
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.851247 W | snap: skipped unexpected non snapshot file 00000000007b5080.snap.db
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852483 I | etcdserver: recovered store from snapshot at index 8081536
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852504 I | etcdserver: name = etcd697.aws-us-east-1-memory.19.dblayer.com
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852510 I | etcdserver: data dir = /data/etcd
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852516 I | etcdserver: member dir = /data/etcd/member
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852521 I | etcdserver: heartbeat = 500ms
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852525 I | etcdserver: election = 5000ms
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852530 I | etcdserver: snapshot count = 1000
Mar 28 21:45:27 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:27.852542 I | etcdserver: advertise client URLs = http://10.197.80.130:2379
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.127894 I | etcdserver: restarting member 11561d69df4299ed in cluster 861e1d72f0835e7a at commit index 8157572
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131329 I | raft: 11561d69df4299ed became follower at term 125
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131363 I | raft: newRaft 11561d69df4299ed [peers: [11561d69df4299ed,1fd3856b78fe333e,f69252dd496581b1], term: 125, commit: 8157572, applied: 8081536, lastindex: 8157572, lastterm: 125]
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131515 I | etcdserver/api: enabled capabilities for version 3.1
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131539 I | etcdserver/membership: added member 11561d69df4299ed [http://10.197.80.130:2380] to cluster 861e1d72f0835e7a from store
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131547 I | etcdserver/membership: added member 1fd3856b78fe333e [http://10.197.80.132:2380] to cluster 861e1d72f0835e7a from store
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131554 I | etcdserver/membership: added member f69252dd496581b1 [http://10.197.80.131:2380] to cluster 861e1d72f0835e7a from store
Mar 28 21:45:28 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:28.131561 I | etcdserver/membership: set the cluster version to 3.1 from store
Mar 28 21:45:36 aws-us-east-1-memory etcd697: etcd.19  | 2017-03-28 21:45:36.527334 C | etcdmain: database file (/data/etcd/member/snap/db index 7622690) does not match with snapshot (index 8081536).
@doodles526
Copy link
Author

To get this node to come back, I needed to manually delete the latest snapshot and WAL file, and have it recover the rest from the other nodes in the cluster

@heyitsanthony
Copy link
Contributor

Possible candidate for failpoint testing. Start etcd so it snapshots frequently and inject a sleep before the db sync, kill it, then check if the node restarts cleanly.

@heyitsanthony heyitsanthony added this to the v3.2.0 milestone Apr 11, 2017
@fanminshi fanminshi self-assigned this Apr 26, 2017
@fanminshi
Copy link
Member

@doodles526 could you provide me an exact steps to reproduce the same issue?

@hasbro17
Copy link
Contributor

Steps to reproduce the above issue:
Have a snapshot snapshot.db ready from some previous cluster.

Restore the data-dir from the snapshot for the first member of our cluster:

$ ETCDCTL_API=3 bin/etcdctl snapshot restore snapshot.db --name infra1 --initial-cluster infra1=http://127.0.0.1:2380 --initial-cluster-token etcd-cluster-1   --initial-advertise-peer-urls http://127.0.0.1:2380

Create the 1st member of a new cluster using the restored data-dir infra1.etcd:

$ bin/etcd --data-dir=infra1.etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:2380 --initial-advertise-peer-urls http://127.0.0.1:2380 --initial-cluster 'infra1=http://127.0.0.1:2380' --initial-cluster-state new

Add a 2nd member to cluster:

$ ETCDCTL_API=3 bin/etcdctl member add infra2 --peer-urls=http://127.0.0.1:22380

Start etcd server for 2nd member:

$ bin/etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

Kill and restart the etcd server for the 2nd member:

^C
$ bin/etcd --data-dir=infra2.etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster 'infra1=http://127.0.0.1:2380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing

The 2nd member will exit with the following error:

2017-04-27 16:10:25.544472 C | etcdmain: database file (bin/infra2.etcd/member/snap/db index 4) does not match with snapshot (index 5).

@hasbro17
Copy link
Contributor

The snapshot that I used was just from writing 1 key value pair to a single member etcd cluster:

$ ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 put foo1 bar1
OK
$ ETCDCTL_API=3 etcdctl --endpoints http://127.0.0.1:2379 snapshot save snapshot.db

@heyitsanthony
Copy link
Contributor

@hasbro17 that looks like a separate issue. There's no etcdctl snapshot or membership reconfiguration involved in this one. etcd isn't syncing with raft correctly for this one. Open another issue?

@hasbro17
Copy link
Contributor

@heyitsanthony done #7834

fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db, restarting
follower results loading old db. This will causes a index
mismatch between snap metadata index and consistent index
from db.

The pr fixes the above on init of etcdserver through:

1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
2. rename xxx.snap.db to db if exists.
3. load backend again with the new db file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db, restarting
follower results loading old db. This will causes a index
mismatch between snap metadata index and consistent index
from db.

The pr fixes the above on init of etcdserver through:

1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
2. rename xxx.snap.db to db if exists.
3. load backend again with the new db file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db, restarting
follower results loading old db. This will causes a index
mismatch between snap metadata index and consistent index
from db.

The pr fixes the above on init of etcdserver through:

1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
2. rename xxx.snap.db to db if exists.
3. load backend again with the new db file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db, restarting
follower results loading old db. This will causes a index
mismatch between snap metadata index and consistent index
from db.

The pr fixes the above on init of etcdserver through:

1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
2. rename xxx.snap.db to db if exists.
3. load backend again with the new db file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db, restarting
follower results loading old db. This will causes a index
mismatch between snap metadata index and consistent index
from db.

The pr fixes the above on init of etcdserver through:

1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
2. rename xxx.snap.db to db if exists.
3. load backend again with the new db file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 4, 2017
…nap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen before snapshot is persisted to wal and snap file.
this ensures that db file can never be newer than  wal and snap file.
hence, it guarantees the invariant snapshot.Metadata.Index <= db.ConsistentIndex()
in NewServer() when checking validity of db and snap file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 5, 2017
…nap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen before snapshot is persisted to wal and snap file.
this ensures that db file can never be newer than  wal and snap file.
hence, it guarantees the invariant snapshot.Metadata.Index <= db.ConsistentIndex()
in NewServer() when checking validity of db and snap file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 5, 2017
…nap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen before snapshot is persisted to wal and snap file.
this ensures that db file can never be newer than  wal and snap file.
hence, it guarantees the invariant snapshot.Metadata.Index <= db.ConsistentIndex()
in NewServer() when checking validity of db and snap file.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 8, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 8, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 9, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 9, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
fanminshi added a commit to fanminshi/etcd that referenced this issue May 9, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
yudai pushed a commit to yudai/etcd that referenced this issue Oct 5, 2017
…ap files

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES etcd-io#7628
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants