Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica becomes corrupted after being OOM killed #7957

Closed
cberner opened this issue May 19, 2017 · 4 comments
Closed

Replica becomes corrupted after being OOM killed #7957

cberner opened this issue May 19, 2017 · 4 comments
Labels

Comments

@cberner
Copy link

cberner commented May 19, 2017

Bug reporting

Etcd Version: 3.1.6
Hardware:
Cloud: Azure
Instance type: Standard_DS15_v2
Number of replicas: 3
Data disk: local temporary SSD (benchmarked with fio to have ~100us latency and ~10k sequential IOPS at 2k write size)

Description:
Twice (in the last week) one of our replicas has gone down and not recovered. The symptoms this time were pretty clear:
08:00) CPU usage triples, Etcd disk usage quadruples (due to a large number of snapshots being kept https://gist.github.com/cberner/8125c9c16eaa9ecd0c00f888882946bc), memory usage doubles and etcd process is OOM killed
After that there are a couple Go panics and the member never rejoins the cluster. Removing it, deleting its snapshots and re-adding it allowed us to recover.

I've attached what seems like the relevant section of the log, but also have the full log and a copy of the data directory, if you need more information. I've also attached Datadog screen shots from the host (note those times are in Pacific time, and the log is in UTC).
screen shot 2017-05-19 at 11 18 08 am
screen shot 2017-05-19 at 11 18 14 am

etcd_short.txt

@heyitsanthony
Copy link
Contributor

Similar to #7628 but this panicking looks new:

2017-05-19 08:08:09.011587 W | snap: skipped unexpected non snapshot file 00000000cfdde752.snap.db
2017-05-19 08:08:09.011597 W | snap: skipped unexpected non snapshot file 00000000d0658a64.snap.db
2017-05-19 08:08:09.011602 W | snap: skipped unexpected non snapshot file 00000000d0536c45.snap.db
2017-05-19 08:08:09.011607 W | snap: skipped unexpected non snapshot file 00000000cd791e04.snap.db
2017-05-19 08:08:09.011620 W | snap: skipped unexpected non snapshot file 00000000cf52c7cc.snap.db
2017-05-19 08:08:09.011625 W | snap: skipped unexpected non snapshot file 00000000cfeeb8e2.snap.db
2017-05-19 08:08:09.011629 W | snap: skipped unexpected non snapshot file 00000000cfefda75.snap.db
2017-05-19 08:08:09.011633 W | snap: skipped unexpected non snapshot file 00000000cf211571.snap.db
2017-05-19 08:08:09.011637 W | snap: skipped unexpected non snapshot file 00000000cff109fc.snap.db
2017-05-19 08:08:09.011641 W | snap: skipped unexpected non snapshot file 00000000d075dc9d.snap.db
2017-05-19 08:08:09.011645 W | snap: skipped unexpected non snapshot file 00000000d03bed29.snap.db
2017-05-19 08:08:09.011650 W | snap: skipped unexpected non snapshot file 00000000d014bd1d.snap.db
2017-05-19 08:08:09.011654 W | snap: skipped unexpected non snapshot file 00000000cb22067f.snap.db
2017-05-19 08:08:09.011658 W | snap: skipped unexpected non snapshot file 00000000cf4a953c.snap.db
2017-05-19 08:08:09.011662 W | snap: skipped unexpected non snapshot file 00000000ccffbebe.snap.db
2017-05-19 08:08:09.011666 W | snap: skipped unexpected non snapshot file 00000000cf8e9f84.snap.db
2017-05-19 08:08:09.011670 W | snap: skipped unexpected non snapshot file 00000000cf82e4ae.snap.db
2017-05-19 08:08:09.011673 W | snap: skipped unexpected non snapshot file 00000000cfb16302.snap.db
2017-05-19 08:08:09.011677 W | snap: skipped unexpected non snapshot file 00000000d0a2bb4b.snap.db
2017-05-19 08:08:09.011682 W | snap: skipped unexpected non snapshot file 00000000d024336e.snap.db
2017-05-19 08:08:09.031211 I | etcdserver: recovered store from snapshot at index 3500325707
2017-05-19 08:08:09.031230 I | etcdserver: name = main
2017-05-19 08:08:09.031235 I | etcdserver: data dir = /mnt/azure_tempdisk/etcd2/data
2017-05-19 08:08:09.031240 I | etcdserver: member dir = /mnt/azure_tempdisk/etcd2/data/member
2017-05-19 08:08:09.031244 I | etcdserver: heartbeat = 100ms
2017-05-19 08:08:09.031249 I | etcdserver: election = 1000ms
2017-05-19 08:08:09.031253 I | etcdserver: snapshot count = 10000
2017-05-19 08:08:09.031265 I | etcdserver: advertise client URLs = https://0.etcd-azure-us-east-v2.sci.openai.org:2381
2017-05-19 08:08:09.031344 W | wal: ignored file 000000000000044e-00000000d09eda4c.wal.broken in wal
2017-05-19 08:08:10.576075 I | etcdserver: restarting member 475dd2e324774d9 in cluster 6c72d4e0abafee58 at commit index 3500910805
2017-05-19 08:08:10.733903 I | raft: 475dd2e324774d9 became follower at term 9005
2017-05-19 08:08:10.733980 I | raft: newRaft 475dd2e324774d9 [peers: [475dd2e324774d9,7d614fe5111d249d,afa07d330ba6276c], term: 9005, commit: 3500910805, applied: 3500325707, lastindex: 3500911234, lastterm: 9005]
2017-05-19 08:08:10.734307 I | etcdserver/api: enabled capabilities for version 3.1
2017-05-19 08:08:10.734339 I | etcdserver/membership: added member 475dd2e324774d9 [https://0.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734353 I | etcdserver/membership: added member 7d614fe5111d249d [https://2.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734365 I | etcdserver/membership: added member afa07d330ba6276c [https://1.etcd-azure-us-east-v2.sci.openai.org:2380] to cluster 6c72d4e0abafee58 from store
2017-05-19 08:08:10.734378 I | etcdserver/membership: set the cluster version to 3.1 from store
2017-05-19 08:08:11.648577 I | mvcc: restore compact to 130169190
2017-05-19 08:08:12.210374 I | mvcc: store.index: compact 130228338
2017-05-19 08:08:12.274411 I | mvcc: resume scheduled compaction at 130228338
panic: assertion failed: tx closed

goroutine 131 [running]:
panic(0xcbb000, 0xc43d90bff0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt._assert(0xd6f900, 0xe446fb, 0x9, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/db.go:1026 +0xff
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Cursor).seek(0xc43e0bdaa0, 0x12d2032, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/cursor.go:155 +0x6f
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Bucket).Bucket(0xc42023a018, 0x12d2032, 0x3, 0x3, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/bucket.go:112 +0x108
github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt.(*Tx).Bucket(0xc42023a000, 0x12d2032, 0x3, 0x3, 0x7f72cf7ee3f5)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/boltdb/bolt/tx.go:101 +0x4f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc4202760a0, 0x12d2032, 0x3, 0x3, 0xc43d91e500, 0x11, 0x11, 0xc43d90bfe0, 0x8, 0x8, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:88 +0x79
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).scheduleCompaction(0xc4201dc0d0, 0x7c32072, 0xc43db100c0, 0xc400000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_compaction.go:38 +0x2d3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).Compact.func2(0x7f754fabf590, 0xc42da3b840)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:312 +0xbb
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc42da3f020)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0xdd
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x1e8

heyitsanthony added a commit to heyitsanthony/etcd that referenced this issue May 22, 2017
Lots of garbage db files in etcd-io#7957. Should purge.
@lachie83
Copy link

@xiang90 This is the issue we discussed earlier on k8s slack

gyuho pushed a commit that referenced this issue Jun 2, 2017
Lots of garbage db files in #7957. Should purge.
@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

@lachie83

Do you get any chance to reproduce this issue?

@heyitsanthony
Copy link
Contributor

Closing due to inactivity and no way to reproduce. The mvcc/backend/boltdb changes since 3.1 are significant enough that whatever is happening here may already be fixed.

yudai pushed a commit to yudai/etcd that referenced this issue Oct 5, 2017
Lots of garbage db files in etcd-io#7957. Should purge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants