db: investigate elevated batch commit tail latencies #2646

jbowens · 2023-06-16T21:00:17Z

The following graphs are from one particular instance of this, but we've seen it across several test clusters including the 23.1 test cluster.

Internal discussion: https://cockroachlabs.slack.com/archives/C057ULDSKC0/p1686947338066129?thread_ts=1686924156.296339&cid=C057ULDSKC0

jbowens · 2023-06-16T21:20:24Z

(Potentially) related to:

Also, ingests can be disruptive to batch commit tail latencies, although it is definitely not the only factor because we've seen elevated batch commit tail latencies in the absence of ingests.:

jbowens · 2023-06-21T21:44:34Z

As far as I can tell, there are two primary sources of tail latency:

There's contention on d.commit.mu. Some batch commits block on acquiring d.commit.mu for 90ms:

slow batch commit: ‹(TotDur: 93.940495ms, SemaWait: 206ns, WALQueueWait: 0s, MemtblStall: 0s, L0Stall: 0s, WALRot: 0s, CommitWait: 489.115µs, MemtblApply: 10.924µs, Publish: 489.572µs, CommitMu: 93.436399ms, WALWrite: 1.094µs, DBMu: 124ns, PendingEnqueue: 308ns, RoomForWrite: 376ns, MemtblRot: 0s, MemtblPrep: 219ns)›

This seems to be driven by slow DB.makeRoomForWrite calls, which tend to be dominated by the memtable rotation:

slow make room for write: ‹(TotDur: 191.177358ms, SemaWait: 336ns, WALQueueWait: 0s, MemtblStall: 0s, L0Stall: 0s, WALRot: 607.996µs, CommitWait: 48ns, MemtblApply: 4.215µs, Publish: 797ns, CommitMu: 88ns, WALWrite: 2.49µs, DBMu: 91ns, PendingEnqueue: 227ns, RoomForWrite: 191.165895ms, MemtblRot: 190.55501ms, MemtblPrep: 347ns)›

I believe we're seeing contention on the table cache mutexes while reserving space for the memtables: #1997. These long memtable rotations keep DB.commit.mu locked throughout.

Once a batch has been applied to the memtable, we need to wait for a) the fsync to complete, if applicable, and b) for the sequence number to be published.

slow batch commit: ‹(TotDur: 80.033189ms, SemaWait: 233ns, WALQueueWait: 0s, MemtblStall: 0s, L0Stall: 0s, WALRot: 0s, CommitWait: 80.027419ms, MemtblApply: 2.887µs, Publish: 80.027626ms, CommitMu: 343ns, WALWrite: 441ns, DBMu: 32ns, PendingEnqueue: 126ns, RoomForWrite: 284ns, MemtblRot: 0s, MemtblPrep: 170ns)›

I tried to look at whether this might be solely due to the fsync by setting kv.raft_log.disable_synchronization_unsafe = true and found some batches still wait tens of milliseconds:

I230621 21:34:35.452617 1048 3@pebble/db.go:881 ⋮ [n2,s2,pebble] 5426  vslow batch commit: ‹(TotDur: 57.603998ms, SemaWait: 172ns, WALQueueWait: 0s, MemtblStall: 0s, L0Stall: 0s, WALRot: 0s, CommitWait: 57.596621ms, MemtblApply: 5.669µs, Publish: 57.596777ms, CommitMu: 72ns, WALWrite: 196ns, DBMu: 38ns, PendingEnqueue: 119ns, RoomForWrite: 183ns, MemtblRot: 0s, MemtblPrep: 99ns)›

I wonder if there's contention on the visible sequence number?:

		// We're responsible for publishing the sequence number for batch t, but
		// another concurrent goroutine might sneak in and publish the sequence
		// number for a subsequent batch. That's ok as all we're guaranteeing is
		// that the sequence number ratchets up.
		for {
			curSeqNum := p.env.visibleSeqNum.Load()
			newSeqNum := t.SeqNum() + uint64(t.Count())
			if newSeqNum <= curSeqNum {
				// t's sequence number has already been published.
				break
			}
			if p.env.visibleSeqNum.CompareAndSwap(curSeqNum, newSeqNum) {
				// We successfully published t's sequence number.
				break
			}
		}

jbowens · 2023-06-26T16:07:49Z

I played with a few tweaks looking to understand and reduce the tail latencies, including avoiding acquiring DB.mu in order to gain a reference to the mutable memtable: b77606f...a891d5d

It can be difficult to see past the noise, especially in the presence of write stalls which can spike log commit latency dramatically or on process start. The below graph uses a MIN aggregator to see past the spikes. It shows these changes do help in the very tail, but are more muted at p99.

For comparison, the WAL latency:

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for the next memtable allocation. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 105.56µ ± 1% 88.11µ ± 2% -16.53% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.3Ki ± 0% ~ (p=0.418 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 115.0 ± 0% +0.88% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can contribute to batch commit tail latencies by slowing down WAL/memtable rotation during which the entire commit pipeline is stalled. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs cockroachdb#2646.

We've observed large allocations like the 64MB memtable allocation take 10ms+. This can add latency to the WAL/memtable rotation critical section during which the entire commit pipeline is stalled, contributing to batch commit tail latencies. This commit adapts the memtable lifecycle to keep the most recent obsolete memtable around for use as the next mutable memtable. This reduces the commit latency hiccup during a memtable rotation, and it also reduces block cache mutex contention (#1997) by reducing the number of times we must reserve memory from the block cache. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.30GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RotateMemtables-24 120.7µ ± 2% 102.8µ ± 4% -14.85% (p=0.000 n=25) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RotateMemtables-24 124.3Ki ± 0% 124.0Ki ± 0% -0.27% (p=0.000 n=25) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RotateMemtables-24 114.0 ± 0% 111.0 ± 0% -2.63% (p=0.000 n=25) ``` Informs #2646.

jbowens added T-storage A-storage C-performance labels Jun 16, 2023

blathers-crl bot added this to Incoming in (Deprecated) Storage Jun 16, 2023

jbowens self-assigned this Jun 22, 2023

jbowens moved this from Incoming to In Progress (this milestone) in (Deprecated) Storage Jun 22, 2023

jbowens mentioned this issue Jun 26, 2023

db: avoid acquiring DB.mu in the commit pipeline #2680

Closed

jbowens mentioned this issue Jul 25, 2023

db: keep up to one memtable for recycling #2772

Merged

jbowens closed this as completed Aug 14, 2023

(Deprecated) Storage automation moved this from In Progress (this milestone) to Done Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: investigate elevated batch commit tail latencies #2646

db: investigate elevated batch commit tail latencies #2646

jbowens commented Jun 16, 2023 •

edited

Loading

jbowens commented Jun 16, 2023

jbowens commented Jun 21, 2023

jbowens commented Jun 26, 2023

db: investigate elevated batch commit tail latencies #2646

db: investigate elevated batch commit tail latencies #2646

Comments

jbowens commented Jun 16, 2023 • edited Loading

jbowens commented Jun 16, 2023

jbowens commented Jun 21, 2023

jbowens commented Jun 26, 2023

jbowens commented Jun 16, 2023 •

edited

Loading