go: store/nbs: table_reader: getManyAtOffsetsWithReadFunc: Stop unbounded I/O parallelism in GetMany implementation. #91
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When we do things like push, pull or (soon-to-be) garbage collection, we have large sets of Chunk addresses that we pass into
ChunkStore#GetMany
and then go off and process. Clients largely try to control the memory overhead and pipeline depth by passing in a buffered channel of an appropriate size. The expectation is that the implementation ofGetMany
will have an amount of data in flight at any give in time that is in some reasonable way proportional to the channel size.In the current implementation, there is unbounded concurrency on the read destination allocations and the reads themselves, with one go routine spawned for each byte range we want to read. This results in absolutely massive (virtual) heap utilization and unreasonable I/O parallelism and context switch thrashing in large repo push/pull situations.
This is a small PR to change the concurrency paradigm inside
getManyAtOffsetsWithReadFunc
so that we only have 4 concurrent dispatched reads pertable_reader
instance at a time.This is still not the behavior we actually want.
tableReader
s), and not depend on the number oftableReader
s which happen to back the chunk store.I'm landing this as a big incremental improvement over status quo. Here are some non-reproducible one-shot test results from a test program. The test program walks the entire chunk graph, assembles every chunk address, and then does a
GetManyCompressed
on every chunk address and copies their contents to/dev/null
. It was run on a ~10GB (compressed) data set:Before:
After:
On these runs, sys time, wallclock time, vm page reclaims and virtual memory used are all improved pretty substantially.
Very open to feedback and discussion of potential performance regressions here, but I think this is an incremental win for now.