Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

production: registration outage postmortem 9/19 #30399

Closed
benesch opened this issue Sep 19, 2018 · 11 comments
Closed

production: registration outage postmortem 9/19 #30399

benesch opened this issue Sep 19, 2018 · 11 comments
Labels
C-investigation Further steps needed to qualify. C-label will change. no-issue-activity X-stale

Comments

@benesch
Copy link
Contributor

benesch commented Sep 19, 2018

Yesterday we experienced an outage of our internal registration cluster, which we consider a production cluster. I'll hopefully have time to flesh this postmortem out further tomorrow. For now, the short summary is that the cluster found itself in a situation with many unquiesced replicas, each of which was periodically heartbeating. Due to two bugs, one in CockroachDB (#30398) and one in etcd/raft (etcd-io/etcd#10106), each of these heartbeats was causing an fsync. The AWS EBS volumes on which the registration cluster nodes run were capped at 600 IOPS per second; the storm of fsyncs blew past this cap. The cost appears to have been strict throttling, resulting in many-second disk writes. This rendered all nodes permanently non-live, as node liveness heartbeats could never complete successfully, and so the cluster could not make progress.

The cluster was running v2.1-beta.20180910 at the time of the outage. It is now running a custom binary with etcd-io/etcd#10106 applied and has recovered.

I was able to reproduce a somewhat similar storm of non-quiescent replicas on my own three-node cluster by taking two nodes down at the same time:

image

After bringing one of the down nodes back online, two thirds of the replicas are unquiesced (perhaps the thirds whose leases were on the downed nodes?), and the cluster is only very slowly requiescing them, to the tune of 30/min. I thought we'd made progress on this front with #26911, but perhaps there is more to do.

/cc @bdarnell @nvanbenschoten @petermattis @dt

@petermattis
Copy link
Collaborator

I think the interesting difference on the registration cluster is the use of AWS EBS volumes with IOPS caps. As far as I know, we're not running test clusters in that configuration. We probably should.

@bdarnell
Copy link
Contributor

I think the interesting difference on the registration cluster is the use of AWS EBS volumes with IOPS caps.

On the class of EBS volume we were using here, the IO cap is 3 IOs per second per GB of storage. We were using about a quarter of the space on our 200GB volume and maxing out the IOPS. Before the outage, we were using about half of the IO budget, suggesting that we would be unable to use all the storage space on an EBS volume, and you'd need to provision double the space just to get the necessary IO budget.

Now that the empty-commit patch has been deployed, we're using just over a quarter of the IO budget (and still just under a quarter of the space). It looks like we're likely to hit the IO wall again just before we run out of space. We should keep an eye on our IO usage as a function of space usage so we can set guidelines appropriately for users deploying on configurations like this.

@a-robinson
Copy link
Contributor

For what it's worth, we're ignoring our own docs' recommendation to use provisioned IOPS SSD-backed EBS (the io1 type) rather than the gp2 type.

@benesch
Copy link
Contributor Author

benesch commented Sep 21, 2018

Ok, I think I finally got to the bottom of the persistent registration cluster unhappiness. The problem is that we have large, repetitive primary keys—some as many as 64 kilobytes in length. These keys compress extremely well, resulting in SSTs that, in one case, is 2+ GB when uncompressed but only 100MB when compressed:

Table Properties:
------------------------------
  # data blocks: 28278
  # entries: 3461846
  # range deletions: 0
  raw key size: 2083232043
  raw average key size: 601.769126
  raw value size: 17396331
  raw average value size: 5.025160
  data block size: 104867013
  index block size: 520682676
  filter block size: 0
  (estimated) table size: 625549689
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 0
  column family name: default
  comparator name: cockroach_comparator
  merge operator name: cockroach_merge_operator
  property collectors names: [TimeBoundTblPropCollectorFactory]
  SST file compression algo: Snappy
  creation time: 1537457152
  time stamp of earliest key: 0
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # crdb.ts.max: 0x15562510381C7372
  # crdb.ts.min: 0x14D2DAB77312C71B
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x30
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00
total number of files: 1
total number of data blocks: 28278
total data block size: 104867013
total index block size: 520682676
total filter block size: 0

Now, problematically, our RocksDB is configured to pin index and filter blocks in memory, outside of the block cache. (See this open issue: #7576.) Some quick back-of-the-envelope math suggests there are 4GB (!) of index blocks on register 1 right now:

# cockroach-benesch2 debug sst_dump properly calls `fflush(stdout)` before exiting
$ for f in *.sst; do ~/cockroach-benesch2 debug sst_dump --file=$f --show_summary --show_properties > $f.info; done
$ grep 'total index block size' *.info | cut -f3 -d: | paste -sd+ | bc
4064689786

The fix was literally as simple as deploying a binary with table_options.cache_index_and_filter_blocks = true, as suggested in #7576.

Unfortunately this might be causing quite a bit of thrashing. The nightly backup is taking much longer than usual.

@petermattis
Copy link
Collaborator

The performance hit from cache_index_and_filter_blocks was a few percent. The impact might be higher on register. We'll have to keep on eye on it.

Note that the performance hit mentioned in #7576 was mild (a few percent). I have an idea for how to mitigate that. But the effect on register might be more severe because we won't just be hitting the cache to access the index and filter blocks, but reading (and uncompressing) them from disk.

Something to investigate is that RocksDB now has an index_block_restart_interval option. I'm not sure when this was added, but I don't recall investigating it the last time I tuned RocksDB options. The default value is 1 which means that keys in index blocks are not prefix compressed. If the keys in the index blocks share large prefixes, then setting that option to 2 or 4 could halve or quarter the size of the index blocks.

@benesch
Copy link
Contributor Author

benesch commented Sep 21, 2018

The performance hit from cache_index_and_filter_blocks was a few percent. The impact might be higher on register. We'll have to keep on eye on it.

Yeah, we caused some serious unhappiness after flipping cache_index_and_filter_blocks on when we tried to run a backup concurrently with a ton of rebalancing activity. Wouldn't be surprised if this was cache thrashing caused by the new option. But this time reg just slowed down instead of OOMing, so I guess thats' an improvement.

Something to investigate is that RocksDB now has an index_block_restart_interval option. I'm not sure when this was added, but I don't recall investigating it the last time I tuned RocksDB options. The default value is 1 which means that keys in index blocks are not prefix compressed. If the keys in the index blocks share large prefixes, then setting that option to 2 or 4 could halve or quarter the size of the index blocks.

Yeah, @ajkr suggested this. Unfortunately it's too late on the registration cluster since these SSTs are already created, but it's definitely worth looking into to prevent this problem in new clusters.

We could also investigate two-level indexes: https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html.

@petermattis
Copy link
Collaborator

Yeah, @ajkr suggested this. Unfortunately it's too late on the registration cluster since these SSTs are already created, but it's definitely worth looking into to prevent this problem in new clusters.

We could enable that option and then force compact all of the sstables.

@dt
Copy link
Member

dt commented Sep 21, 2018

We also need to export/import the reg data anyway to change the type of a column in the primary key, so we were planning on making new SSTs, in an offline cluster, if we want to experiment with flags in that cluster.

@ajkr
Copy link
Contributor

ajkr commented Sep 21, 2018

Another useful option for large index blocks is partitioned index (set BlockBasedTableOptions::index_type to kTwoLevelIndexSearch). That splits the index block so there's a root partition, which can be pinned in-memory, and many subpartitions, which are always stored in block cache. The size of the subpartitions is controlled by BlockBasedTableOptions::metadata_block_size. That option bounds the size of the allocations/block cache insertions that will be done as part of index reads.

More details: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters

@benesch
Copy link
Contributor Author

benesch commented Sep 21, 2018

Ok, great, next steps:

  • See if we can get reg back on the mainline by plumbing the custom RocksDB option through cockroach start via something like --store=foo,rocksdb=block_based_table_factory={index_block_restart_interval=16}.

  • File an issue about investigating two-level indexes.

@petermattis petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Oct 14, 2018
@github-actions
Copy link

github-actions bot commented Jun 5, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

6 participants