-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replica becomes corrupted after being OOM killed #7957
Labels
Comments
Similar to #7628 but this panicking looks new:
|
heyitsanthony
added a commit
to heyitsanthony/etcd
that referenced
this issue
May 22, 2017
Lots of garbage db files in etcd-io#7957. Should purge.
@xiang90 This is the issue we discussed earlier on k8s slack |
gyuho
pushed a commit
that referenced
this issue
Jun 2, 2017
Lots of garbage db files in #7957. Should purge.
Do you get any chance to reproduce this issue? |
Closing due to inactivity and no way to reproduce. The mvcc/backend/boltdb changes since 3.1 are significant enough that whatever is happening here may already be fixed. |
yudai
pushed a commit
to yudai/etcd
that referenced
this issue
Oct 5, 2017
Lots of garbage db files in etcd-io#7957. Should purge.
This was referenced May 4, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug reporting
Etcd Version: 3.1.6
Hardware:
Cloud: Azure
Instance type: Standard_DS15_v2
Number of replicas: 3
Data disk: local temporary SSD (benchmarked with fio to have ~100us latency and ~10k sequential IOPS at 2k write size)
Description:
Twice (in the last week) one of our replicas has gone down and not recovered. The symptoms this time were pretty clear:
08:00) CPU usage triples, Etcd disk usage quadruples (due to a large number of snapshots being kept https://gist.github.com/cberner/8125c9c16eaa9ecd0c00f888882946bc), memory usage doubles and etcd process is OOM killed
After that there are a couple Go panics and the member never rejoins the cluster. Removing it, deleting its snapshots and re-adding it allowed us to recover.
I've attached what seems like the relevant section of the log, but also have the full log and a copy of the data directory, if you need more information. I've also attached Datadog screen shots from the host (note those times are in Pacific time, and the log is in UTC).
etcd_short.txt
The text was updated successfully, but these errors were encountered: