cache: block allocations during compaction cause expensive TLB shootdowns #2693

nvanbenschoten · 2023-06-28T21:17:52Z

See Slack thread: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1687921559962319.

perf profiles from a write-heavy workload show a concerning amount of cpu time performing TLB shootdowns (native_flush_tlb_multi) under __madvise (I believe this is a call from glibc by jemalloc and not the go runtime) and when page faulting in cache.newValue. We see an improvement with numa alignment, but the issue is still severe (27% of cpu without alignment, 18% with alignment).

This is reproducible on AWS with the following steps:

roachprod create $USER-test -n6 --clouds=aws --aws-machine-type-ssd='i4i.8xlarge' --aws-enable-multiple-stores=true
roachprod stage $USER-test cockroach # master
roachprod start $USER-test:1-5 --env=COCKROACH_ROCKSDB_CONCURRENCY=12
roachprod sql $USER-test:1 -- -e='ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5'
roachprod run $USER-test:6 -- './cockroach workload init kv --splits=20 --scatter {pgurl:1-5}'
roachprod run $USER-test:6 -- './cockroach workload run kv --concurrency=1024 --max-rate=60000 --min-block-bytes=2048 --max-block-bytes=2048 --ramp=1m --target-compression-ratio=1.5 {pgurl:1-5}'

The leading theory here is that placing uncompressed blocks in the block cache during compaction leads to a high likelihood that it is freed by a different thread than the one that allocated it, and that this makes the tcache ineffective (page fault on rewrite, cross-core tlb flush on madvise). To verify this, we should prototype a change that does not populate the block cache during compaction reads.

The text was updated successfully, but these errors were encountered:

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (cockroachdb#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (cockroachdb#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

During compactions, avoid populating the block cache with input files' blocks. These files will soon be removed from the LSM, making it less likely any iterator will need to read these blocks. While Pebble uses a scan-resistant block cache algorithm (ClockPRO), the act of inserting the blocks into the cache increases contention on the block cache mutexes (#1997). This contention has been observed to significantly contribute to tail latencies, both for reads and for writes during memtable reservation. Additionally, although these blocks may be soon replaced with more useful blocks due to ClockPRO's scan resistance, they may be freed by a different thread inducing excessive TLB shootdowns (#2693). A compaction only requires a relatively small working set of buffers during its scan across input sstables. In this commit, we introduce a per-compaction BufferPool that is used to allocate buffers during cache misses. Buffers are reused throughout the compaction and only freed to the memory allocator when they're too small or the compaction is finished. This reduces pressure on the memory allocator and the block cache.

jbowens · 2023-08-07T16:40:31Z

I'm going to close this out. We're tentatively not backporting this change 23.1, although we we may reevaluate depending on customer requirements.

nvanbenschoten added T-storage A-storage labels Jun 28, 2023

nvanbenschoten added this to Incoming in Storage via automation Jun 28, 2023

nvanbenschoten mentioned this issue Jun 29, 2023

kvserver: investigate raft entry cache sizing cockroachdb/cockroach#98666

Closed

jbowens moved this from Incoming to In Progress (this milestone) in Storage Jun 30, 2023

jbowens self-assigned this Jun 30, 2023

jbowens mentioned this issue Jul 13, 2023

db: do not cache compaction block reads #2699

Merged

jbowens closed this as completed Aug 7, 2023

Storage automation moved this from In Progress (this milestone) to Done Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: block allocations during compaction cause expensive TLB shootdowns #2693

cache: block allocations during compaction cause expensive TLB shootdowns #2693

nvanbenschoten commented Jun 28, 2023 •

edited

jbowens commented Aug 7, 2023

cache: block allocations during compaction cause expensive TLB shootdowns #2693

cache: block allocations during compaction cause expensive TLB shootdowns #2693

Comments

nvanbenschoten commented Jun 28, 2023 • edited

jbowens commented Aug 7, 2023

nvanbenschoten commented Jun 28, 2023 •

edited