Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: block allocations during compaction cause expensive TLB shootdowns #2693

Closed
nvanbenschoten opened this issue Jun 28, 2023 · 1 comment
Assignees
Projects

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 28, 2023

See Slack thread: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1687921559962319.

perf profiles from a write-heavy workload show a concerning amount of cpu time performing TLB shootdowns (native_flush_tlb_multi) under __madvise (I believe this is a call from glibc by jemalloc and not the go runtime) and when page faulting in cache.newValue. We see an improvement with numa alignment, but the issue is still severe (27% of cpu without alignment, 18% with alignment).

This is reproducible on AWS with the following steps:

roachprod create $USER-test -n6 --clouds=aws --aws-machine-type-ssd='i4i.8xlarge' --aws-enable-multiple-stores=true
roachprod stage $USER-test cockroach # master
roachprod start $USER-test:1-5 --env=COCKROACH_ROCKSDB_CONCURRENCY=12
roachprod sql $USER-test:1 -- -e='ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5'
roachprod run $USER-test:6 -- './cockroach workload init kv --splits=20 --scatter {pgurl:1-5}'
roachprod run $USER-test:6 -- './cockroach workload run kv --concurrency=1024 --max-rate=60000 --min-block-bytes=2048 --max-block-bytes=2048 --ramp=1m --target-compression-ratio=1.5 {pgurl:1-5}'

The leading theory here is that placing uncompressed blocks in the block cache during compaction leads to a high likelihood that it is freed by a different thread than the one that allocated it, and that this makes the tcache ineffective (page fault on rewrite, cross-core tlb flush on madvise). To verify this, we should prototype a change that does not populate the block cache during compaction reads.


perf_aws_32

@nvanbenschoten nvanbenschoten added this to Incoming in Storage via automation Jun 28, 2023
@jbowens jbowens moved this from Incoming to In Progress (this milestone) in Storage Jun 30, 2023
@jbowens jbowens self-assigned this Jun 30, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (cockroachdb#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(cockroachdb#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 14, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (cockroachdb#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(cockroachdb#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
jbowens added a commit that referenced this issue Jul 16, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
@jbowens
Copy link
Collaborator

jbowens commented Aug 7, 2023

I'm going to close this out. We're tentatively not backporting this change 23.1, although we we may reevaluate depending on customer requirements.

@jbowens jbowens closed this as completed Aug 7, 2023
Storage automation moved this from In Progress (this milestone) to Done Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Storage
  
Done
Development

No branches or pull requests

2 participants