Replica becomes corrupted after being OOM killed #7957

cberner · 2017-05-19T18:19:17Z

Bug reporting

Etcd Version: 3.1.6
Hardware:
Cloud: Azure
Instance type: Standard_DS15_v2
Number of replicas: 3
Data disk: local temporary SSD (benchmarked with fio to have ~100us latency and ~10k sequential IOPS at 2k write size)

Description:
Twice (in the last week) one of our replicas has gone down and not recovered. The symptoms this time were pretty clear:
08:00) CPU usage triples, Etcd disk usage quadruples (due to a large number of snapshots being kept https://gist.github.com/cberner/8125c9c16eaa9ecd0c00f888882946bc), memory usage doubles and etcd process is OOM killed
After that there are a couple Go panics and the member never rejoins the cluster. Removing it, deleting its snapshots and re-adding it allowed us to recover.

I've attached what seems like the relevant section of the log, but also have the full log and a copy of the data directory, if you need more information. I've also attached Datadog screen shots from the host (note those times are in Pacific time, and the log is in UTC).

etcd_short.txt

heyitsanthony · 2017-05-19T20:08:56Z

Similar to #7628 but this panicking looks new:

2017-05-19 08:08:09.011587 W | snap: skipped unexpected non snapshot file 00000000cfdde752.snap.db
2017-05-19 08:08:09.011597 W | snap: skipped unexpected non snapshot file 00000000d0658a64.snap.db
2017-05-19 08:08:09.011602 W | snap: skipped unexpected non snapshot file 00000000d0536c45.snap.db
2017-05-19 08:08:09.011607 W | snap: skipped unexpected non snapshot file 00000000cd791e04.snap.db
2017-05-19 08:08:09.011620 W | snap: skipped unexpected non snapshot file 00000000cf52c7cc.snap.db
2017-05-19 08:08:09.011625 W | snap: skipped unexpected non snapshot file 00000000cfeeb8e2.snap.db
2017-05-19 08:08:09.011629 W | snap: skipped unexpected non snapshot file 00000000cfefda75.snap.db
2017-05-19 08:08:09.011633 W | snap: skipped unexpected non snapshot file 00000000cf211571.snap.db
2017-05-19 08:08:09.011637 W | snap: skipped unexpected non snapshot file 00000000cff109fc.snap.db
2017-05-19 08:08:09.011641 W | snap: skipped unexpected non snapshot file 00000000d075dc9d.snap.db
2017-05-19 08:08:09.011645 W | snap: skipped unexpected non snapshot file 00000000d03bed29.snap.db
2017-05-19 08:08:09.011650 W | snap: skipped unexpected non snapshot file 00000000d014bd1d.snap.db
2017-05-19 08:08:09.011654 W | snap: skipped unexpected non snapshot file 00000000cb22067f.snap.db
2017-05-19 08:08:09.011658 W | snap: skipped unexpected non snapshot file 00000000cf4a953c.snap.db
2017-05-19 08:08:09.011662 W | snap: skipped unexpected non snapshot file 00000000ccffbebe.snap.db
2017-05-19 08:08:09.011666 W | snap: skipped unexpected non snapshot file 00000000cf8e9f84.snap.db
2017-05-19 08:08:09.011670 W | snap: skipped unexpected non snapshot file 00000000cf82e4ae.snap.db
2017-05-19 08:08:09.011673 W | snap: skipped unexpected non snapshot file 00000000cfb16302.snap.db
2017-05-19 08:08:09.011677 W | snap: skipped unexpected non snapshot file 00000000d0a2bb4b.snap.db
2017-05-19 08:08:09.011682 W | snap: skipped unexpected non snapshot file 00000000d024336e.snap.db
2017-05-19 08:08:09.031211 I | etcdserver: recovered store from snapshot at index 3500325707
2017-05-19 08:08:09.031230 I | etcdserver: name = main
2017-05-19 08:08:09.031235 I | etcdserver: data dir = /mnt/azure_tempdisk/etcd2/data
2017-05-19 08:08:09.031240 I | etcdserver: member dir = /mnt/azure_tempdisk/etcd2/data/member
2017-05-19 08:08:09.031244 I | etcdserver: heartbeat = 100ms
2017-05-19 08:08:09.031249 I | etcdserver: election = 1000ms
2017-05-19 08:08:09.031253 I | etcdserver: snapshot count = 10000
2017-05-19 08:08:09.031265 I | etcdserver: advertise client URLs = https://0.etcd-azure-us-east-v2.sci.openai.org:2381
2017-05-19 08:08:09.031344 W | wal: ignored file 000000000000044e-00000000d09eda4c.wal.broken in wal
2017-05-19 08:08:10.576075 I | etcdserver: restarting member 475dd2e324774d9 in cluster 6c72d4e0abafee58 at commit index 3500910805
2017-05-19 08:08:10.733903 I | raft: 475dd2e324774d9 became follower at term 9005
2017-05-19 08:08:10.733980 I | raft: newRaft 475dd2e324774d9 [peers: [475dd2e324774d9,7d614fe5111d249d,afa07d330ba6276c], term: 9005, commit: 3500910805, applied: 3500325707, lastindex: 3500911234, lastterm: 9005]
2017-05-19 08:08:10.734307 I | etcdserver/api: enabled capabilities for version 3.1
2017-05-19 08:08:10.734339 I | etcdserver/membership: added member 475dd2e324774d9 [https://0.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734353 I | etcdserver/membership: added member 7d614fe5111d249d [https://2.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734365 I | etcdserver/membership: added member afa07d330ba6276c [https://1.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734378 I | etcdserver/membership: set the cluster version to 3.1 from store
2017-05-19 08:08:11.648577 I | mvcc: restore compact to 130169190
2017-05-19 08:08:12.210374 I | mvcc: store.index: compact 130228338
2017-05-19 08:08:12.274411 I | mvcc: resume scheduled compaction at 130228338
panic: assertion failed: tx closed

goroutine 131 [running]:
panic(0xcbb000, 0xc43d90bff0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt._assert(0xd6f900, 0xe446fb, 0x9, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/db.go:1026 +0xff
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Cursor).seek(0xc43e0bdaa0, 0x12d2032, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/cursor.go:155 +0x6f
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Bucket).Bucket(0xc42023a018, 0x12d2032, 0x3, 0x3, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/bucket.go:112 +0x108
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Tx).Bucket(0xc42023a000, 0x12d2032, 0x3, 0x3, 0x7f72cf7ee3f5)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/tx.go:101 +0x4f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc4202760a0, 0x12d2032, 0x3, 0x3, 0xc43d91e500, 0x11, 0x11, 0xc43d90bfe0, 0x8, 0x8, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:88 +0x79
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).scheduleCompaction(0xc4201dc0d0, 0x7c32072, 0xc43db100c0, 0xc400000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_compaction.go:38 +0x2d3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).Compact.func2(0x7f754fabf590, 0xc42da3b840)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:312 +0xbb
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc42da3f020)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0xdd
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x1e8

Lots of garbage db files in etcd-io#7957. Should purge.

lachie83 · 2017-05-22T23:33:40Z

@xiang90 This is the issue we discussed earlier on k8s slack

Lots of garbage db files in #7957. Should purge.

xiang90 · 2017-06-16T03:52:48Z

@lachie83

Do you get any chance to reproduce this issue?

heyitsanthony · 2017-07-27T01:09:18Z

Closing due to inactivity and no way to reproduce. The mvcc/backend/boltdb changes since 3.1 are significant enough that whatever is happening here may already be fixed.

Lots of garbage db files in etcd-io#7957. Should purge.

heyitsanthony added the type/bug label May 19, 2017

heyitsanthony mentioned this issue May 22, 2017

etcdserver: close mvcc.KV on init error path #7966

Merged

heyitsanthony added a commit to heyitsanthony/etcd that referenced this issue May 22, 2017

etcdserver: purge old snap.db files

ab16fa1

Lots of garbage db files in etcd-io#7957. Should purge.

heyitsanthony mentioned this issue May 22, 2017

etcdserver: purge old snap.db files #7967

Merged

gyuho pushed a commit that referenced this issue Jun 2, 2017

etcdserver: purge old snap.db files

f1a89a2

Lots of garbage db files in #7957. Should purge.

heyitsanthony closed this as completed Jul 27, 2017

yudai pushed a commit to yudai/etcd that referenced this issue Oct 5, 2017

etcdserver: purge old snap.db files

6f1071f

Lots of garbage db files in etcd-io#7957. Should purge.

This was referenced May 4, 2018

Cherry-picking PR 7967 to release-3.1 #9693

Merged

Cherry-picking PR 7967 to release-3.2 #9694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica becomes corrupted after being OOM killed #7957

Replica becomes corrupted after being OOM killed #7957

cberner commented May 19, 2017

heyitsanthony commented May 19, 2017

lachie83 commented May 22, 2017

xiang90 commented Jun 16, 2017

heyitsanthony commented Jul 27, 2017

Replica becomes corrupted after being OOM killed #7957

Replica becomes corrupted after being OOM killed #7957

Comments

cberner commented May 19, 2017

Bug reporting

heyitsanthony commented May 19, 2017

lachie83 commented May 22, 2017

xiang90 commented Jun 16, 2017

heyitsanthony commented Jul 27, 2017