production: registration outage postmortem 9/19 #30399

benesch · 2018-09-19T05:08:14Z

Yesterday we experienced an outage of our internal registration cluster, which we consider a production cluster. I'll hopefully have time to flesh this postmortem out further tomorrow. For now, the short summary is that the cluster found itself in a situation with many unquiesced replicas, each of which was periodically heartbeating. Due to two bugs, one in CockroachDB (#30398) and one in etcd/raft (etcd-io/etcd#10106), each of these heartbeats was causing an fsync. The AWS EBS volumes on which the registration cluster nodes run were capped at 600 IOPS per second; the storm of fsyncs blew past this cap. The cost appears to have been strict throttling, resulting in many-second disk writes. This rendered all nodes permanently non-live, as node liveness heartbeats could never complete successfully, and so the cluster could not make progress.

The cluster was running v2.1-beta.20180910 at the time of the outage. It is now running a custom binary with etcd-io/etcd#10106 applied and has recovered.

I was able to reproduce a somewhat similar storm of non-quiescent replicas on my own three-node cluster by taking two nodes down at the same time:

After bringing one of the down nodes back online, two thirds of the replicas are unquiesced (perhaps the thirds whose leases were on the downed nodes?), and the cluster is only very slowly requiescing them, to the tune of 30/min. I thought we'd made progress on this front with #26911, but perhaps there is more to do.

/cc @bdarnell @nvanbenschoten @petermattis @dt

The text was updated successfully, but these errors were encountered:

petermattis · 2018-09-19T14:04:30Z

I think the interesting difference on the registration cluster is the use of AWS EBS volumes with IOPS caps. As far as I know, we're not running test clusters in that configuration. We probably should.

bdarnell · 2018-09-19T18:10:25Z

I think the interesting difference on the registration cluster is the use of AWS EBS volumes with IOPS caps.

On the class of EBS volume we were using here, the IO cap is 3 IOs per second per GB of storage. We were using about a quarter of the space on our 200GB volume and maxing out the IOPS. Before the outage, we were using about half of the IO budget, suggesting that we would be unable to use all the storage space on an EBS volume, and you'd need to provision double the space just to get the necessary IO budget.

Now that the empty-commit patch has been deployed, we're using just over a quarter of the IO budget (and still just under a quarter of the space). It looks like we're likely to hit the IO wall again just before we run out of space. We should keep an eye on our IO usage as a function of space usage so we can set guidelines appropriately for users deploying on configurations like this.

a-robinson · 2018-09-19T18:19:47Z

For what it's worth, we're ignoring our own docs' recommendation to use provisioned IOPS SSD-backed EBS (the io1 type) rather than the gp2 type.

benesch · 2018-09-21T04:17:36Z

Ok, I think I finally got to the bottom of the persistent registration cluster unhappiness. The problem is that we have large, repetitive primary keys—some as many as 64 kilobytes in length. These keys compress extremely well, resulting in SSTs that, in one case, is 2+ GB when uncompressed but only 100MB when compressed:

Table Properties:
------------------------------
  # data blocks: 28278
  # entries: 3461846
  # range deletions: 0
  raw key size: 2083232043
  raw average key size: 601.769126
  raw value size: 17396331
  raw average value size: 5.025160
  data block size: 104867013
  index block size: 520682676
  filter block size: 0
  (estimated) table size: 625549689
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 0
  column family name: default
  comparator name: cockroach_comparator
  merge operator name: cockroach_merge_operator
  property collectors names: [TimeBoundTblPropCollectorFactory]
  SST file compression algo: Snappy
  creation time: 1537457152
  time stamp of earliest key: 0
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # crdb.ts.max: 0x15562510381C7372
  # crdb.ts.min: 0x14D2DAB77312C71B
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x30
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00
total number of files: 1
total number of data blocks: 28278
total data block size: 104867013
total index block size: 520682676
total filter block size: 0

Now, problematically, our RocksDB is configured to pin index and filter blocks in memory, outside of the block cache. (See this open issue: #7576.) Some quick back-of-the-envelope math suggests there are 4GB (!) of index blocks on register 1 right now:

# cockroach-benesch2 debug sst_dump properly calls `fflush(stdout)` before exiting
$ for f in *.sst; do ~/cockroach-benesch2 debug sst_dump --file=$f --show_summary --show_properties > $f.info; done
$ grep 'total index block size' *.info | cut -f3 -d: | paste -sd+ | bc
4064689786

The fix was literally as simple as deploying a binary with table_options.cache_index_and_filter_blocks = true, as suggested in #7576.

Unfortunately this might be causing quite a bit of thrashing. The nightly backup is taking much longer than usual.

petermattis · 2018-09-21T12:28:15Z

The performance hit from cache_index_and_filter_blocks was a few percent. The impact might be higher on register. We'll have to keep on eye on it.

Note that the performance hit mentioned in #7576 was mild (a few percent). I have an idea for how to mitigate that. But the effect on register might be more severe because we won't just be hitting the cache to access the index and filter blocks, but reading (and uncompressing) them from disk.

Something to investigate is that RocksDB now has an index_block_restart_interval option. I'm not sure when this was added, but I don't recall investigating it the last time I tuned RocksDB options. The default value is 1 which means that keys in index blocks are not prefix compressed. If the keys in the index blocks share large prefixes, then setting that option to 2 or 4 could halve or quarter the size of the index blocks.

benesch · 2018-09-21T14:39:27Z

The performance hit from cache_index_and_filter_blocks was a few percent. The impact might be higher on register. We'll have to keep on eye on it.

Yeah, we caused some serious unhappiness after flipping cache_index_and_filter_blocks on when we tried to run a backup concurrently with a ton of rebalancing activity. Wouldn't be surprised if this was cache thrashing caused by the new option. But this time reg just slowed down instead of OOMing, so I guess thats' an improvement.

Something to investigate is that RocksDB now has an index_block_restart_interval option. I'm not sure when this was added, but I don't recall investigating it the last time I tuned RocksDB options. The default value is 1 which means that keys in index blocks are not prefix compressed. If the keys in the index blocks share large prefixes, then setting that option to 2 or 4 could halve or quarter the size of the index blocks.

Yeah, @ajkr suggested this. Unfortunately it's too late on the registration cluster since these SSTs are already created, but it's definitely worth looking into to prevent this problem in new clusters.

We could also investigate two-level indexes: https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html.

petermattis · 2018-09-21T14:47:51Z

Yeah, @ajkr suggested this. Unfortunately it's too late on the registration cluster since these SSTs are already created, but it's definitely worth looking into to prevent this problem in new clusters.

We could enable that option and then force compact all of the sstables.

dt · 2018-09-21T14:50:13Z

We also need to export/import the reg data anyway to change the type of a column in the primary key, so we were planning on making new SSTs, in an offline cluster, if we want to experiment with flags in that cluster.

ajkr · 2018-09-21T14:54:17Z

Another useful option for large index blocks is partitioned index (set BlockBasedTableOptions::index_type to kTwoLevelIndexSearch). That splits the index block so there's a root partition, which can be pinned in-memory, and many subpartitions, which are always stored in block cache. The size of the subpartitions is controlled by BlockBasedTableOptions::metadata_block_size. That option bounds the size of the allocations/block cache insertions that will be done as part of index reads.

More details: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters

benesch · 2018-09-21T15:37:27Z

Ok, great, next steps:

See if we can get reg back on the mainline by plumbing the custom RocksDB option through cockroach start via something like --store=foo,rocksdb=block_based_table_factory={index_block_restart_interval=16}.
File an issue about investigating two-level indexes.

github-actions · 2021-06-05T02:16:34Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

petermattis mentioned this issue Sep 21, 2018

stability: improve handling of indexes with large column values #30515

Closed

petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Oct 14, 2018

github-actions bot added the no-issue-activity label Jun 5, 2021

github-actions bot added the X-stale label Jun 17, 2021

github-actions bot closed this as completed Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

production: registration outage postmortem 9/19 #30399

production: registration outage postmortem 9/19 #30399

benesch commented Sep 19, 2018

petermattis commented Sep 19, 2018

bdarnell commented Sep 19, 2018

a-robinson commented Sep 19, 2018

benesch commented Sep 21, 2018

petermattis commented Sep 21, 2018

benesch commented Sep 21, 2018

petermattis commented Sep 21, 2018

dt commented Sep 21, 2018

ajkr commented Sep 21, 2018 •

edited

Loading

benesch commented Sep 21, 2018

github-actions bot commented Jun 5, 2021

production: registration outage postmortem 9/19 #30399

production: registration outage postmortem 9/19 #30399

Comments

benesch commented Sep 19, 2018

petermattis commented Sep 19, 2018

bdarnell commented Sep 19, 2018

a-robinson commented Sep 19, 2018

benesch commented Sep 21, 2018

petermattis commented Sep 21, 2018

benesch commented Sep 21, 2018

petermattis commented Sep 21, 2018

dt commented Sep 21, 2018

ajkr commented Sep 21, 2018 • edited Loading

benesch commented Sep 21, 2018

github-actions bot commented Jun 5, 2021

ajkr commented Sep 21, 2018 •

edited

Loading