Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk: memory monitor sst batching during RESTORE #64938

Closed
wants to merge 1 commit into from

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented May 10, 2021

This commit narrowly focuses on adding memory monitoring for the
buffering portion of a RESTORE. SSTs are buffered during restore before
being flushed with an AddSSTable request. The memory used to buffer the
table is now accounted for under the bulk memory monitor.

Memory pressure becomes more likely once this buffering is parallelized
in #61942.

Memory used by iterators will be tracked in a separate PR.

Informs #64936.

Release note (enterprise change): Data buffeed during RESTORE is now
counted towards the cluster's --sql-max-memory limit to help guard
against OOMs. If not enough memory is available, the RESTORE will fail.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea pbardea force-pushed the sst-batcher-mem branch 5 times, most recently from b1ac789 to 539de95 Compare May 17, 2021 19:38
@pbardea pbardea changed the title bulk: memory monitor RESTORE bulk: memory monitor sst batching during RESTORE May 17, 2021
@pbardea pbardea requested review from adityamaru and dt May 18, 2021 14:41
@pbardea pbardea marked this pull request as ready for review May 18, 2021 14:41
@pbardea pbardea requested a review from a team May 18, 2021 14:41
@pbardea
Copy link
Contributor Author

pbardea commented May 18, 2021

This should be ready for a review. I've tried a few different approaches, and settled on this one but curious about your thoughts here. The main difficulty that I'm running into is that the writer in Pebble does the writing. Some approaches that I've tried here:

  1. Just reserve enough memory up until maxSize upon Reset. There's a couple of issues where this isn't exactly correct. If the user were to increase the buffer size during the batching, we would not account for this new buffer size until that limit is reached. We also don't fully account for all allocations done by the writer in this case.
  2. Account for the memory used by each Put() into the writer. This also doesn't fully account for the allocations done by the writer under the sst batcher, but does have the advantage that it would hit the memory limit if the buffer limit was changed mid ingestion.
  3. (The current approach) Keep track of the allocations directly in the MemFile that the SST batcher writes to. Here we can properly account for the allocations that are actually happening. We free allocations when the file is closed. This seems to be okay for the RESTORE use-case but when splitting an SSTable we hold references to the contents of these files after they are closed. This memory should be separately accounted for by the batcher. But I wanted to get feedback if there was something obviously wrong with this approach since it doesn't look like we have precedent around these kinds of buffers that account for their own memory.

@pbardea pbardea force-pushed the sst-batcher-mem branch 3 times, most recently from a5548db to d0d67e0 Compare May 18, 2021 22:42

// AccountedMemFile wraps a MemFile so that can account for the memory it uses
// in the given monitor if one is provided.
type AccountedMemFile struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this feel a little out of place in pkg/storage? AFAIK, nothing else in storage uses monitors. I think we could put this in bulk where it is used instead.

pkg/kv/bulk/sst_batcher.go Show resolved Hide resolved
newCap := f.MemFile.Cap()
log.Infof(f.ctx, "growing by %d", int64(newCap-oldCap))
if err := f.memAcc.Grow(f.ctx, int64(newCap-oldCap)); err != nil {
return n, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crashing is obviously worse than erring here, sure, but ideally I'd say we could do even better, and make flushes budget-aware instead?

Total strawman that I haven't thought through: have Write flip a cheap-to-check bool. Move this grow attempt to the adding loop, inside a check of that bool, and trigger a flush if it fails.

We could even avoid the check-after-growing-cap case that way, I think, by determining if the next write could push it over e.g. if blocksize > cap-len, then we can suspect next write will grow cap, so we could go ahead and ask for more now and flush if we get denied. If we pad that by some trailer size, then I think we don't need to worry about flushing the trailer causing a realloc to grow cap, though I don't know if we can predict that perfectly since it'll depend on index size. Hmm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at that approach briefly and found that it coupled the batcher pretty tightly with sstable.Writer specifics pretty quickly which I originally wanted to avoid. One option here could be to have a pessimistic estimate (e.g. 1.5 x block size). AddMVCCKey can grow by this estimate if it detects that the next write will grow the buffer, and if that fails flush. If the Grow succeeds, we can shrink the buffer by our estimate and grow by the actual size the buffer increased on Write. That way the memory accounting stays up to date most of the time.

@dt
Copy link
Member

dt commented May 19, 2021

for completeness, we should probably also account for the sstable.Writer buffer of up to block-size, of which it'll make a copy during compression, so maybe add 2xblock-size to our reservation. I don't know how to estimate the size of the accumulated index entries though since it is all held in memory until we flush. Maybe we should just make up a number and reserve that much for overhead?

@dt
Copy link
Member

dt commented May 19, 2021

I guess ingestion SSTs will use decently small blocks so we're looking at <100kb, and given the whole SST is batched to <100mb, the index probably isn't all that big either. Maybe we don't care about the Writer overhead.

@pbardea pbardea force-pushed the sst-batcher-mem branch 2 times, most recently from 693e62b to 7426ee1 Compare May 20, 2021 14:25
@pbardea
Copy link
Contributor Author

pbardea commented May 24, 2021

Following our conversation from today, maybe this memory monitoring can be simplified since it doesn't need to be that accurate. The SST buffer isn't typically that large and is comparable to the minimum buffer size of a buffering adder.

I'm thinking we can just reserve maxFlushSize() (and possible some overhead for the sst writers) up front. This would allow the workers to fail up front. We'll likely reserve more memory than needed given that each block is compressed when flushed but seems like a reasonable upper bound. There's also other minor discrepancies between the DataSize and the actual amount of bytes used, but I think it's a good estimate.

We'd also need to periodically check if the maxFlushSize changed, but I think it's reasonable to potentially error out if the cluster setting was changed mid-restore and there wasn't enough memory in the cluster.

Is there anything obvious that I missed here? If not I'll start on that implementation.

@pbardea
Copy link
Contributor Author

pbardea commented May 24, 2021

I updated the approach here to align with what's described above. Since the buffer here is supposed to stay pretty small, and we frequently expect to fill it up I think it makes sense to start off with trying to allocate the memory up front and failing if there isn't enough available. This is also the simplest to backport in terms of combination with the parallelization changes.

When combined with the parallelization, this could be extended by teaching the workers to back off if they're running into memory pressure by re-queuing their work and reducing the size of the worker pool.

@pbardea pbardea requested a review from dt May 24, 2021 20:45
@pbardea pbardea force-pushed the sst-batcher-mem branch 3 times, most recently from 5e2b7ea to 8154fd5 Compare May 27, 2021 12:15
@pbardea pbardea mentioned this pull request May 27, 2021
2 tasks
@pbardea pbardea force-pushed the sst-batcher-mem branch 2 times, most recently from 6fc0310 to f5f8559 Compare June 2, 2021 12:21
This commit narrowly focuses on adding memory monitoring for the
buffering portion of a RESTORE. SSTs are buffered during restore before
being flushed with an AddSSTable request. The memory used to buffer the
table is now accounted for under the bulk memory monitor.

Restore will attempt to flush its buffer if it runs out of memory it can
reserve.

Release note (enterprise change): Data buffeed during RESTORE is now
counted towards the cluster's --sql-max-memory limit to help guard
against OOMs.
@pbardea
Copy link
Contributor Author

pbardea commented Jul 13, 2021

Going to break this down starting with #67477

@pbardea pbardea closed this Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants