New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulk: memory monitor sst batching during RESTORE #64938
Conversation
b1ac789
to
539de95
Compare
This should be ready for a review. I've tried a few different approaches, and settled on this one but curious about your thoughts here. The main difficulty that I'm running into is that the writer in Pebble does the writing. Some approaches that I've tried here:
|
a5548db
to
d0d67e0
Compare
pkg/storage/sst_writer.go
Outdated
|
||
// AccountedMemFile wraps a MemFile so that can account for the memory it uses | ||
// in the given monitor if one is provided. | ||
type AccountedMemFile struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this feel a little out of place in pkg/storage? AFAIK, nothing else in storage uses monitors. I think we could put this in bulk where it is used instead.
pkg/storage/sst_writer.go
Outdated
newCap := f.MemFile.Cap() | ||
log.Infof(f.ctx, "growing by %d", int64(newCap-oldCap)) | ||
if err := f.memAcc.Grow(f.ctx, int64(newCap-oldCap)); err != nil { | ||
return n, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crashing is obviously worse than erring here, sure, but ideally I'd say we could do even better, and make flushes budget-aware instead?
Total strawman that I haven't thought through: have Write
flip a cheap-to-check bool. Move this grow attempt to the adding loop, inside a check of that bool, and trigger a flush if it fails.
We could even avoid the check-after-growing-cap case that way, I think, by determining if the next write could push it over e.g. if blocksize > cap-len, then we can suspect next write will grow cap, so we could go ahead and ask for more now and flush if we get denied. If we pad that by some trailer size, then I think we don't need to worry about flushing the trailer causing a realloc to grow cap, though I don't know if we can predict that perfectly since it'll depend on index size. Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at that approach briefly and found that it coupled the batcher pretty tightly with sstable.Writer
specifics pretty quickly which I originally wanted to avoid. One option here could be to have a pessimistic estimate (e.g. 1.5 x block size). AddMVCCKey can grow by this estimate if it detects that the next write will grow the buffer, and if that fails flush. If the Grow succeeds, we can shrink the buffer by our estimate and grow by the actual size the buffer increased on Write. That way the memory accounting stays up to date most of the time.
for completeness, we should probably also account for the sstable.Writer buffer of up to block-size, of which it'll make a copy during compression, so maybe add 2xblock-size to our reservation. I don't know how to estimate the size of the accumulated index entries though since it is all held in memory until we flush. Maybe we should just make up a number and reserve that much for overhead? |
I guess ingestion SSTs will use decently small blocks so we're looking at <100kb, and given the whole SST is batched to <100mb, the index probably isn't all that big either. Maybe we don't care about the Writer overhead. |
693e62b
to
7426ee1
Compare
Following our conversation from today, maybe this memory monitoring can be simplified since it doesn't need to be that accurate. The SST buffer isn't typically that large and is comparable to the minimum buffer size of a buffering adder. I'm thinking we can just reserve maxFlushSize() (and possible some overhead for the sst writers) up front. This would allow the workers to fail up front. We'll likely reserve more memory than needed given that each block is compressed when flushed but seems like a reasonable upper bound. There's also other minor discrepancies between the DataSize and the actual amount of bytes used, but I think it's a good estimate. We'd also need to periodically check if the maxFlushSize changed, but I think it's reasonable to potentially error out if the cluster setting was changed mid-restore and there wasn't enough memory in the cluster. Is there anything obvious that I missed here? If not I'll start on that implementation. |
I updated the approach here to align with what's described above. Since the buffer here is supposed to stay pretty small, and we frequently expect to fill it up I think it makes sense to start off with trying to allocate the memory up front and failing if there isn't enough available. This is also the simplest to backport in terms of combination with the parallelization changes. When combined with the parallelization, this could be extended by teaching the workers to back off if they're running into memory pressure by re-queuing their work and reducing the size of the worker pool. |
5e2b7ea
to
8154fd5
Compare
6fc0310
to
f5f8559
Compare
This commit narrowly focuses on adding memory monitoring for the buffering portion of a RESTORE. SSTs are buffered during restore before being flushed with an AddSSTable request. The memory used to buffer the table is now accounted for under the bulk memory monitor. Restore will attempt to flush its buffer if it runs out of memory it can reserve. Release note (enterprise change): Data buffeed during RESTORE is now counted towards the cluster's --sql-max-memory limit to help guard against OOMs.
Going to break this down starting with #67477 |
This commit narrowly focuses on adding memory monitoring for the
buffering portion of a RESTORE. SSTs are buffered during restore before
being flushed with an AddSSTable request. The memory used to buffer the
table is now accounted for under the bulk memory monitor.
Memory pressure becomes more likely once this buffering is parallelized
in #61942.
Memory used by iterators will be tracked in a separate PR.
Informs #64936.
Release note (enterprise change): Data buffeed during RESTORE is now
counted towards the cluster's --sql-max-memory limit to help guard
against OOMs. If not enough memory is available, the RESTORE will fail.