bulk: memory monitor sst batching during RESTORE #64938

pbardea · 2021-05-10T14:20:46Z

This commit narrowly focuses on adding memory monitoring for the
buffering portion of a RESTORE. SSTs are buffered during restore before
being flushed with an AddSSTable request. The memory used to buffer the
table is now accounted for under the bulk memory monitor.

Memory pressure becomes more likely once this buffering is parallelized
in #61942.

Memory used by iterators will be tracked in a separate PR.

Informs #64936.

Release note (enterprise change): Data buffeed during RESTORE is now
counted towards the cluster's --sql-max-memory limit to help guard
against OOMs. If not enough memory is available, the RESTORE will fail.

cockroach-teamcity · 2021-05-10T14:20:55Z

This change is

pbardea · 2021-05-18T15:08:50Z

This should be ready for a review. I've tried a few different approaches, and settled on this one but curious about your thoughts here. The main difficulty that I'm running into is that the writer in Pebble does the writing. Some approaches that I've tried here:

Just reserve enough memory up until maxSize upon Reset. There's a couple of issues where this isn't exactly correct. If the user were to increase the buffer size during the batching, we would not account for this new buffer size until that limit is reached. We also don't fully account for all allocations done by the writer in this case.
Account for the memory used by each Put() into the writer. This also doesn't fully account for the allocations done by the writer under the sst batcher, but does have the advantage that it would hit the memory limit if the buffer limit was changed mid ingestion.
(The current approach) Keep track of the allocations directly in the MemFile that the SST batcher writes to. Here we can properly account for the allocations that are actually happening. We free allocations when the file is closed. This seems to be okay for the RESTORE use-case but when splitting an SSTable we hold references to the contents of these files after they are closed. This memory should be separately accounted for by the batcher. But I wanted to get feedback if there was something obviously wrong with this approach since it doesn't look like we have precedent around these kinds of buffers that account for their own memory.

dt · 2021-05-18T20:29:06Z

pkg/storage/sst_writer.go

+
+// AccountedMemFile wraps a MemFile so that can account for the memory it uses
+// in the given monitor if one is provided.
+type AccountedMemFile struct {


does this feel a little out of place in pkg/storage? AFAIK, nothing else in storage uses monitors. I think we could put this in bulk where it is used instead.

pkg/kv/bulk/sst_batcher.go

dt · 2021-05-19T14:01:25Z

pkg/storage/sst_writer.go

+		newCap := f.MemFile.Cap()
+		log.Infof(f.ctx, "growing by %d", int64(newCap-oldCap))
+		if err := f.memAcc.Grow(f.ctx, int64(newCap-oldCap)); err != nil {
+			return n, err


crashing is obviously worse than erring here, sure, but ideally I'd say we could do even better, and make flushes budget-aware instead?

Total strawman that I haven't thought through: have Write flip a cheap-to-check bool. Move this grow attempt to the adding loop, inside a check of that bool, and trigger a flush if it fails.

We could even avoid the check-after-growing-cap case that way, I think, by determining if the next write could push it over e.g. if blocksize > cap-len, then we can suspect next write will grow cap, so we could go ahead and ask for more now and flush if we get denied. If we pad that by some trailer size, then I think we don't need to worry about flushing the trailer causing a realloc to grow cap, though I don't know if we can predict that perfectly since it'll depend on index size. Hmm.

I took a look at that approach briefly and found that it coupled the batcher pretty tightly with sstable.Writer specifics pretty quickly which I originally wanted to avoid. One option here could be to have a pessimistic estimate (e.g. 1.5 x block size). AddMVCCKey can grow by this estimate if it detects that the next write will grow the buffer, and if that fails flush. If the Grow succeeds, we can shrink the buffer by our estimate and grow by the actual size the buffer increased on Write. That way the memory accounting stays up to date most of the time.

dt · 2021-05-19T15:53:20Z

for completeness, we should probably also account for the sstable.Writer buffer of up to block-size, of which it'll make a copy during compression, so maybe add 2xblock-size to our reservation. I don't know how to estimate the size of the accumulated index entries though since it is all held in memory until we flush. Maybe we should just make up a number and reserve that much for overhead?

dt · 2021-05-19T15:59:56Z

I guess ingestion SSTs will use decently small blocks so we're looking at <100kb, and given the whole SST is batched to <100mb, the index probably isn't all that big either. Maybe we don't care about the Writer overhead.

pbardea · 2021-05-24T16:13:39Z

Following our conversation from today, maybe this memory monitoring can be simplified since it doesn't need to be that accurate. The SST buffer isn't typically that large and is comparable to the minimum buffer size of a buffering adder.

I'm thinking we can just reserve maxFlushSize() (and possible some overhead for the sst writers) up front. This would allow the workers to fail up front. We'll likely reserve more memory than needed given that each block is compressed when flushed but seems like a reasonable upper bound. There's also other minor discrepancies between the DataSize and the actual amount of bytes used, but I think it's a good estimate.

We'd also need to periodically check if the maxFlushSize changed, but I think it's reasonable to potentially error out if the cluster setting was changed mid-restore and there wasn't enough memory in the cluster.

Is there anything obvious that I missed here? If not I'll start on that implementation.

pbardea · 2021-05-24T20:45:27Z

I updated the approach here to align with what's described above. Since the buffer here is supposed to stay pretty small, and we frequently expect to fill it up I think it makes sense to start off with trying to allocate the memory up front and failing if there isn't enough available. This is also the simplest to backport in terms of combination with the parallelization changes.

When combined with the parallelization, this could be extended by teaching the workers to back off if they're running into memory pressure by re-queuing their work and reducing the size of the worker pool.

This commit narrowly focuses on adding memory monitoring for the buffering portion of a RESTORE. SSTs are buffered during restore before being flushed with an AddSSTable request. The memory used to buffer the table is now accounted for under the bulk memory monitor. Restore will attempt to flush its buffer if it runs out of memory it can reserve. Release note (enterprise change): Data buffeed during RESTORE is now counted towards the cluster's --sql-max-memory limit to help guard against OOMs.

pbardea · 2021-07-13T14:30:50Z

Going to break this down starting with #67477

pbardea force-pushed the sst-batcher-mem branch 5 times, most recently from b1ac789 to 539de95 Compare May 17, 2021 19:38

pbardea changed the title ~~bulk: memory monitor RESTORE~~ bulk: memory monitor sst batching during RESTORE May 17, 2021

pbardea force-pushed the sst-batcher-mem branch from 539de95 to 43935b0 Compare May 18, 2021 14:40

pbardea requested review from adityamaru and dt May 18, 2021 14:41

pbardea marked this pull request as ready for review May 18, 2021 14:41

pbardea requested a review from a team May 18, 2021 14:41

pbardea force-pushed the sst-batcher-mem branch from 43935b0 to 0ae7252 Compare May 18, 2021 15:02

pbardea force-pushed the sst-batcher-mem branch 3 times, most recently from a5548db to d0d67e0 Compare May 18, 2021 22:42

dt reviewed May 19, 2021

View reviewed changes

pbardea force-pushed the sst-batcher-mem branch 2 times, most recently from 693e62b to 7426ee1 Compare May 20, 2021 14:25

pbardea force-pushed the sst-batcher-mem branch from 7426ee1 to 8d51461 Compare May 24, 2021 20:42

pbardea requested a review from dt May 24, 2021 20:45

pbardea force-pushed the sst-batcher-mem branch 3 times, most recently from 5e2b7ea to 8154fd5 Compare May 27, 2021 12:15

pbardea mentioned this pull request May 27, 2021

bulkio: memory monitor RESTORE #64936

Open

2 tasks

pbardea force-pushed the sst-batcher-mem branch 2 times, most recently from 6fc0310 to f5f8559 Compare June 2, 2021 12:21

pbardea force-pushed the sst-batcher-mem branch from f5f8559 to 7890e10 Compare June 2, 2021 12:25

pbardea closed this Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk: memory monitor sst batching during RESTORE #64938

bulk: memory monitor sst batching during RESTORE #64938

pbardea commented May 10, 2021 •

edited

cockroach-teamcity commented May 10, 2021

pbardea commented May 18, 2021 •

edited

dt May 18, 2021

dt May 19, 2021

pbardea May 19, 2021

dt commented May 19, 2021

dt commented May 19, 2021

pbardea commented May 24, 2021

pbardea commented May 24, 2021

pbardea commented Jul 13, 2021

bulk: memory monitor sst batching during RESTORE #64938

bulk: memory monitor sst batching during RESTORE #64938

Conversation

pbardea commented May 10, 2021 • edited

cockroach-teamcity commented May 10, 2021

pbardea commented May 18, 2021 • edited

dt May 18, 2021

Choose a reason for hiding this comment

dt May 19, 2021

Choose a reason for hiding this comment

pbardea May 19, 2021

Choose a reason for hiding this comment

dt commented May 19, 2021

dt commented May 19, 2021

pbardea commented May 24, 2021

pbardea commented May 24, 2021

pbardea commented Jul 13, 2021

pbardea commented May 10, 2021 •

edited

pbardea commented May 18, 2021 •

edited