kvserver: excessive memory usage when receiving snapshots #73313

erikgrinaker · 2021-11-30T19:38:05Z

In a recent support escalation related to OOMs (see internal Slack thread), we noticed that an excessive amount of memory was being used by receiving Raft snapshots (snapshot-memory.pprof.gz):

This 21.2.0 cluster was running 72 nodes with 8 stores each, and there was constant rebalancing churn due to a suspected issue with the allocator, so as many as 8 in-flight snapshots with many more queued up. As shown in the profile, we're allocating a large amount of memory when receiving the snapshot gRPC messages, and apparently holding onto it elsewhere in the system for longer than we should. (The backup memory usage in that profile is being investigated separately, and can be ignored.)

We do not expect to see this amount of memory usage here. On the sender size, we are sending batches of KV pairs of around 256 KB, which is corroborated by the heap profile (buckets are max 304 KB):

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 348 to 351 in eec9dc3

    
           if bLen := int64(b.Len()); bLen >= kvSS.batchSize { 
        
           	if err := kvSS.sendBatch(ctx, stream, b); err != nil { 
        
           		return 0, err 
        
           	}

On the receiver side, we limit the concurrency of inbound snapshots to 1 per store:

cockroach/pkg/kv/kvserver/store_snapshot.go

Line 582 in eec9dc3

cleanup, err := s.reserveSnapshot(ctx, header)

We then use a multiSSTWriter to spool snapshots to disk, with a chunk size of 512 KB:

cockroach/pkg/kv/kvserver/store_snapshot.go

Line 238 in eec9dc3

    
           msstw, err := newMultiSSTWriter(ctx, kvSS.scratch, keyRanges, kvSS.sstChunkSize)

And then simply put the KV pair into the SST writer:

cockroach/pkg/kv/kvserver/store_snapshot.go

Line 268 in eec9dc3

if err := msstw.Put(ctx, key, batchReader.Value()); err != nil {

Most likely, something inside or adjacent to Pebble ends up holding onto the key and/or value we're passing into Put here. However, attempts to reproduce this by simply feeding batches into Store.receiveSnapshot() have proven fruitless so far.

Jira issue: CRDB-11535

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-11-30T20:47:09Z

This one is proving difficult to track down. I've traced all uses of the returned SnapshotRequest, and in particular, of the KVBatch byte slice and the key/value sub-slices. However, the key/value slices all end up being copied before they're stored anywhere, and the request itself should be eligible for garbage collection on the next kvBatchSnapshotStrategy.Receive loop iteration.

erikgrinaker · 2021-11-30T21:20:01Z

I wonder if this is simply because the memory isn't being GCed fast enough -- I'm not sure if the heap dump includes memory that's eligible for GC but not yet GCed. The heap profile above was taken at 09:22:52.207. We have MemStats dumps taken at 09:22:32.478 and 09:22:53.508:

NumGC here shows 31518 and 31520 respectively, so 2 GC cycles in 20 seconds. The LastGC fields are consistent with this. There's also about 55 GB heap memory currently allocated (these machines have 128 GB memory), so allocating ~10GB heap memory isn't necessarily that big of a deal. It isn't entirely unthinkable that we receive and process snapshots taking up ~10 GB memory in 10-20 seconds, considering the high rebalancing rate, multiple stores, network buffers, etc.

tbg · 2021-12-01T07:25:51Z

That is a good thought, I hadn't considered the amount of RAM these machines have which will cause GC to be much more lenient. Given how much we've investigated already without turning up an alternative explanation, I'm not sure there's much to do here until we see this on another cluster at smaller RAM size, the theory being that we will never see it there due to more frequent GC pacing.

erikgrinaker · 2021-12-01T10:04:24Z

Yeah, I'm going to run a couple of GC/pprof experiments to see how it behaves.

erikgrinaker · 2021-12-03T17:30:49Z

Played around a bit with the GC, it appears like pprof heap profiles only include reachable allocations (i.e. not allocations eligible for GC). Script: heap_test.go.txt.

When reallocating 1 GB of memory inside a loop with 10 iterations, and verifying that no GC has taken place in the meanwhile, the profile only shows 1 GB of allocations even though we've allocated 10 GB across all iterations and this is confirmed to not be GCed by NumGC and HeapAlloc -- the latter shows 10 GB allocations (9 GB of which are unreachable and pending GC), but only 1 GB turns up in the profile. These results hold even when I disable GC entirely with GOGC=off.

This seems to indicate that we do in fact allocate and hold on to the snapshot memory allocations for longer than expected, although I do not understand where that happens since I've carefully audited the code. Perhaps there are some subtle details about memory profiles and GC that I'm missing, or I just didn't audit carefully enough.

Anyway, I think we've spent enough time on this already. I'm going to leave this until we see another instance of this on a different/smaller cluster.

dankinder · 2022-03-29T19:30:12Z

Hi, I may be observing the same thing on one of our clusters where we've had trouble maintaining stability. This is a 16-node v21.2.7 cluster and each node has 64 vCPU, 64GB RAM, and 6 stores (900GB SSD). --cache=0.05 --max-sql-memory=0.1

The heap dumps I've captured don't show anything real interesting, but the metrics show a huge spike in CGo mem which OOMs the node:

It's hard to prove it is snapshot-related but I do see network activity and lease transfers happening before things go nuts:

Here's what I see in the logs right around 18:11 when the lease transfers start. The "skipping snapshot" stuff is happening well before and after this, lots of lines like that. Note that another node (not this one which crashed) was recently re-added to the cluster and was receiving ranges from other nodes, so I expect that's why the send-snapshot timed out here.

I220329 18:11:05.069463 479699351 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603790  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:05.169173 479699052 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603791  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):49LEARNER
I220329 18:11:05.269695 479699354 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603792  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:05.348228 479699213 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603793  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:05.569182 479697919 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603794  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:05.569279 479700194 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603795  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:06.069476 479701407 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603796  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:06.167561 477562849 kv/kvserver/replica_command.go:1564 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603797  could not successfully add and upreplicate LEARNER replica(s) on [n1,s4], rolling back: ‹operation "send-snapshot" timed out after 1h0m0s›: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I220329 18:11:06.168297 477562849 kv/kvserver/replica_command.go:2126 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603798  change replicas (add [] remove [(n1,s4):49LEARNER]): existing descriptor r195757:‹/Table/75/11/-646{7818859760271177/3793/583000273-6004820332880581/3793/834655436}› [(n5,s28):39, (n12,s71):48, (n14,s81):46, (n1,s4):49LEARNER, next=50, gen=1507]
I220329 18:11:06.169731 479701484 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603799  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):49LEARNER
I220329 18:11:06.179260 477562849 kv/kvserver/replica_raft.go:301 ⋮ [n14,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603800  proposing SIMPLE(r49) [(n1,s4):49LEARNER]: after=[(n5,s28):39 (n12,s71):48 (n14,s81):46] next=50
I220329 18:11:06.181561 477562849 kv/kvserver/replica_command.go:1830 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603801  rolled back LEARNER n1,s4 in r195757:‹/Table/75/11/-646{7818859760271177/3793/583000273-6004820332880581/3793/834655436}› [(n5,s28):39, (n12,s71):48, (n14,s81):46, (n1,s4):49LEARNER, next=50, gen=1507]
I220329 18:11:06.181849 477562849 kv/kvserver/replicate_queue.go:312 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603802  ‹operation "send-snapshot" timed out after 1h0m0s›: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
E220329 18:11:06.182047 477562849 kv/kvserver/queue.go:1088 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603803  ‹operation "replicate queue process replica 195757" timed out after 13m0s›: failed to replicate after 5 retries
I220329 18:11:06.270684 479703075 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603804  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:06.348154 479703479 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603805  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:06.569683 479703779 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603806  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:06.569670 479702721 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603807  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:07.069202 479705812 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603808  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:07.269738 479705798 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603809  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:07.348126 479706144 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603810  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:07.469558 479705111 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603811  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:07.570257 479706291 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603812  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:08.070123 479706190 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603813  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:08.269451 479707005 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603814  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER

We've been fighting this for a little while now, seeing OOMs or just latency spikes, and we thought it was because of contention in our workload since we see a small percentage of query restarts happen when things go bad. But in finding this issue I'm wondering if all of that was just a symptom/red herring of some unusual range/snapshot activity.

This cluster has 13k ranges, with most of the data in 2 tables that each are pretty simple and use range min/max 256MB/1GB.

erikgrinaker · 2022-03-30T10:09:36Z

Thanks @dankinder -- the 1 hour snapshot timeout seems bad, and I think indicates some other underlying problem. Could you collect a debug.zip and submit it to https://upload.cockroachlabs.com/u/grinaker?

dankinder · 2022-03-30T16:05:41Z

Yup I just uploaded it @erikgrinaker

erikgrinaker · 2022-03-30T16:16:25Z

Thanks, I'll have a look tomorrow!

dankinder · 2022-03-30T23:00:46Z

Awesome. Btw I realized that my problem probably is not the same as this issue, since the issue you were investigating seemed to be memory use within the Go heap, not cgo/jemalloc. I've been taking heap profiles when I see the memory spike from 6GB up to 60GB+ and I don't see the heap mem really increase. The metrics also indicate it's cgo mem. If you want I can create a separate issue.

The only ways I'm seeing to capture a jemalloc profile are either to export MALLOC_CONF=prof:true" or ln -s prof:true /etc/malloc.conf and hope it gets one during a memory spike... wish I could get one on-demand. If you know of a better way, or anything else I should investigate, let me know...

dankinder · 2022-03-31T13:24:34Z

I had no idea about admission control, I just found out about it from reading #76282 but I will be trying that out as well to see if it mitigates this issue.

erikgrinaker · 2022-03-31T13:28:48Z

Cool, worth a shot. I'm afraid I won't have a chance to look into this today, hope to get around to it tomorrow.

I believe only Pebble uses cgo allocations (but I could be wrong), so it might be something at the storage layer then.

erikgrinaker · 2022-04-05T13:30:17Z

Sorry for the delay, last few days have been hectic.

I've looked at a bunch of heap profiles of around the time when memory usage is high, but the heaps don't show anything drastic. They all have a fair amount of snapshot memory usage though, both when memory usage is high and when it's normal, but it only accounts for something like 200 MB which seems reasonable. E.g.:

I do see a fair bit of snapshot activity though, much of which seems to be load-based rebalancing. You're running with snapshot transfer rates of 8 MB/s, which is fairly low. As long as your disks/network can handle it without interfering with the workload, I'd recommend bumping these to at least 32 MB/s, which is the current default (many users run with 256 MB/s).

kv.snapshot_rebalance.max_rate
kv.snapshot_recovery.max_rate

We've also seen a few instances where the rebalancer was too aggressive in moving replicas around, you can try increasing this to e.g. 0.1 and see if it has any effect:

kv.allocator.range_rebalance_threshold

We have an ongoing investigation into excessive Raft memory usage. Currently we're looking into command application rather than snapshots, and making some progress, but we should have a closer look at snapshots while we're at it (cc @tbg).

the 1 hour snapshot timeout seems bad, and I think indicates some other underlying problem

This was misleading: #79424

dankinder · 2022-04-05T14:18:15Z

Awesome, thanks for taking a look

erikgrinaker · 2022-04-12T08:56:35Z

@dankinder I don't know if this is directly related to the problems you're seeing, but you may want to follow along with #79752 and #79755. We've particularly seen this cause problems when there are a lot of SSTables flying around (due to bulk writes such as schema changes or imports) and stores struggle to keep up with the load, developing read amp that starts throttling the incoming SSTables long enough that they start piling up in the queues.

pav-kv · 2024-03-21T19:48:11Z

Just to throw an idea here, I wonder if the unexpected memory accumulation described in the original comment could be attributed to something like a common bug described in https://go.dev/blog/generic-slice-functions (slices can reference memory from the slots between len(slice) and cap(slice)). Sounds unlikely, but worth to keep in mind.

erikgrinaker · 2024-03-22T14:59:45Z

We've had (and fixed) a few issues like that previously. Definitely plausible.

erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication KV Replication Team labels Nov 30, 2021

erikgrinaker added this to Incoming in Replication via automation Nov 30, 2021

erikgrinaker moved this from Incoming to Backlog in Replication Nov 30, 2021

erikgrinaker added the C-investigation Further steps needed to qualify. C-label will change. label Nov 30, 2021

erikgrinaker moved this from Backlog to 22.2 in Replication Feb 25, 2022

erikgrinaker added the O-support Originated from a customer label Oct 25, 2023

pav-kv added the P-2 Issues/test failures with a fix SLA of 3 months label Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: excessive memory usage when receiving snapshots #73313

kvserver: excessive memory usage when receiving snapshots #73313

erikgrinaker commented Nov 30, 2021 •

edited by exalate-issue-sync bot

erikgrinaker commented Nov 30, 2021

erikgrinaker commented Nov 30, 2021 •

edited

tbg commented Dec 1, 2021

erikgrinaker commented Dec 1, 2021

erikgrinaker commented Dec 3, 2021

dankinder commented Mar 29, 2022 •

edited

erikgrinaker commented Mar 30, 2022

dankinder commented Mar 30, 2022 •

edited

erikgrinaker commented Mar 30, 2022

dankinder commented Mar 30, 2022

dankinder commented Mar 31, 2022

erikgrinaker commented Mar 31, 2022

erikgrinaker commented Apr 5, 2022

dankinder commented Apr 5, 2022

erikgrinaker commented Apr 12, 2022

pav-kv commented Mar 21, 2024 •

edited

erikgrinaker commented Mar 22, 2024

kvserver: excessive memory usage when receiving snapshots #73313

kvserver: excessive memory usage when receiving snapshots #73313

Comments

erikgrinaker commented Nov 30, 2021 • edited by exalate-issue-sync bot

erikgrinaker commented Nov 30, 2021

erikgrinaker commented Nov 30, 2021 • edited

tbg commented Dec 1, 2021

erikgrinaker commented Dec 1, 2021

erikgrinaker commented Dec 3, 2021

dankinder commented Mar 29, 2022 • edited

erikgrinaker commented Mar 30, 2022

dankinder commented Mar 30, 2022 • edited

erikgrinaker commented Mar 30, 2022

dankinder commented Mar 30, 2022

dankinder commented Mar 31, 2022

erikgrinaker commented Mar 31, 2022

erikgrinaker commented Apr 5, 2022

dankinder commented Apr 5, 2022

erikgrinaker commented Apr 12, 2022

pav-kv commented Mar 21, 2024 • edited

erikgrinaker commented Mar 22, 2024

erikgrinaker commented Nov 30, 2021 •

edited by exalate-issue-sync bot

erikgrinaker commented Nov 30, 2021 •

edited

dankinder commented Mar 29, 2022 •

edited

dankinder commented Mar 30, 2022 •

edited

pav-kv commented Mar 21, 2024 •

edited