-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
production: registration outage postmortem 9/19 #30399
Comments
I think the interesting difference on the registration cluster is the use of AWS EBS volumes with IOPS caps. As far as I know, we're not running test clusters in that configuration. We probably should. |
On the class of EBS volume we were using here, the IO cap is 3 IOs per second per GB of storage. We were using about a quarter of the space on our 200GB volume and maxing out the IOPS. Before the outage, we were using about half of the IO budget, suggesting that we would be unable to use all the storage space on an EBS volume, and you'd need to provision double the space just to get the necessary IO budget. Now that the empty-commit patch has been deployed, we're using just over a quarter of the IO budget (and still just under a quarter of the space). It looks like we're likely to hit the IO wall again just before we run out of space. We should keep an eye on our IO usage as a function of space usage so we can set guidelines appropriately for users deploying on configurations like this. |
For what it's worth, we're ignoring our own docs' recommendation to use provisioned IOPS SSD-backed EBS (the |
Ok, I think I finally got to the bottom of the persistent registration cluster unhappiness. The problem is that we have large, repetitive primary keys—some as many as 64 kilobytes in length. These keys compress extremely well, resulting in SSTs that, in one case, is 2+ GB when uncompressed but only 100MB when compressed:
Now, problematically, our RocksDB is configured to pin index and filter blocks in memory, outside of the block cache. (See this open issue: #7576.) Some quick back-of-the-envelope math suggests there are 4GB (!) of index blocks on register 1 right now: # cockroach-benesch2 debug sst_dump properly calls `fflush(stdout)` before exiting
$ for f in *.sst; do ~/cockroach-benesch2 debug sst_dump --file=$f --show_summary --show_properties > $f.info; done
$ grep 'total index block size' *.info | cut -f3 -d: | paste -sd+ | bc
4064689786 The fix was literally as simple as deploying a binary with Unfortunately this might be causing quite a bit of thrashing. The nightly backup is taking much longer than usual. |
The performance hit from Note that the performance hit mentioned in #7576 was mild (a few percent). I have an idea for how to mitigate that. But the effect on register might be more severe because we won't just be hitting the cache to access the index and filter blocks, but reading (and uncompressing) them from disk. Something to investigate is that RocksDB now has an |
Yeah, we caused some serious unhappiness after flipping
Yeah, @ajkr suggested this. Unfortunately it's too late on the registration cluster since these SSTs are already created, but it's definitely worth looking into to prevent this problem in new clusters. We could also investigate two-level indexes: https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html. |
We could enable that option and then force compact all of the sstables. |
We also need to export/import the reg data anyway to change the type of a column in the primary key, so we were planning on making new SSTs, in an offline cluster, if we want to experiment with flags in that cluster. |
Another useful option for large index blocks is partitioned index (set More details: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters |
Ok, great, next steps:
|
We have marked this issue as stale because it has been inactive for |
Yesterday we experienced an outage of our internal registration cluster, which we consider a production cluster. I'll hopefully have time to flesh this postmortem out further tomorrow. For now, the short summary is that the cluster found itself in a situation with many unquiesced replicas, each of which was periodically heartbeating. Due to two bugs, one in CockroachDB (#30398) and one in etcd/raft (etcd-io/etcd#10106), each of these heartbeats was causing an
fsync
. The AWS EBS volumes on which the registration cluster nodes run were capped at 600 IOPS per second; the storm offsync
s blew past this cap. The cost appears to have been strict throttling, resulting in many-second disk writes. This rendered all nodes permanently non-live, as node liveness heartbeats could never complete successfully, and so the cluster could not make progress.The cluster was running v2.1-beta.20180910 at the time of the outage. It is now running a custom binary with etcd-io/etcd#10106 applied and has recovered.
I was able to reproduce a somewhat similar storm of non-quiescent replicas on my own three-node cluster by taking two nodes down at the same time:
After bringing one of the down nodes back online, two thirds of the replicas are unquiesced (perhaps the thirds whose leases were on the downed nodes?), and the cluster is only very slowly requiescing them, to the tune of 30/min. I thought we'd made progress on this front with #26911, but perhaps there is more to do.
/cc @bdarnell @nvanbenschoten @petermattis @dt
The text was updated successfully, but these errors were encountered: