Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go: store/nbs: table_reader: getManyAtOffsetsWithReadFunc: Stop unbounded I/O parallelism in GetMany implementation. #91

Merged
merged 3 commits into from
Sep 26, 2019

Conversation

reltuk
Copy link
Contributor

@reltuk reltuk commented Sep 26, 2019

When we do things like push, pull or (soon-to-be) garbage collection, we have large sets of Chunk addresses that we pass into ChunkStore#GetMany and then go off and process. Clients largely try to control the memory overhead and pipeline depth by passing in a buffered channel of an appropriate size. The expectation is that the implementation of GetMany will have an amount of data in flight at any give in time that is in some reasonable way proportional to the channel size.

In the current implementation, there is unbounded concurrency on the read destination allocations and the reads themselves, with one go routine spawned for each byte range we want to read. This results in absolutely massive (virtual) heap utilization and unreasonable I/O parallelism and context switch thrashing in large repo push/pull situations.

This is a small PR to change the concurrency paradigm inside getManyAtOffsetsWithReadFunc so that we only have 4 concurrent dispatched reads per table_reader instance at a time.

This is still not the behavior we actually want.

  • I/O concurrency should be configurable at the ChunkStore layer (or eventually per-device backing a set of tableReaders), and not depend on the number of tableReaders which happen to back the chunk store.
  • Memory overhead is still not correctly bounded here, since read ahead batches are allowed to grow to arbitrary sizes. Reasonable bounds on memory overhead should be configurable at the ChunkStore layer.

I'm landing this as a big incremental improvement over status quo. Here are some non-reproducible one-shot test results from a test program. The test program walks the entire chunk graph, assembles every chunk address, and then does a GetManyCompressed on every chunk address and copies their contents to /dev/null. It was run on a ~10GB (compressed) data set:

Before:

$ /usr/bin/time -l -- go run test.go
...
MemStats: Sys: 16628128568
      161.29 real        67.29 user       456.38 sys
5106425856  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  10805008  page reclaims
     23881  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         8  signals received
    652686  voluntary context switches
  21071339  involuntary context switches

After:

$ /usr/bin/time -l -- go run test.go
...
MemStats: Sys: 4590759160
       32.17 real        30.53 user        29.62 sys
4561879040  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
   1228770  page reclaims
     67100  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
        14  signals received
    456898  voluntary context switches
   2954503  involuntary context switches

On these runs, sys time, wallclock time, vm page reclaims and virtual memory used are all improved pretty substantially.

Very open to feedback and discussion of potential performance regressions here, but I think this is an incremental win for now.

Aaron Son added 2 commits September 26, 2019 13:27
…nded I/O parallelism in GetMany implementation.
@reltuk reltuk requested a review from bheni September 26, 2019 22:38
Copy link
Contributor

@bheni bheni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. One suggestion


wg.Add(ioParallelism)
for i := 0; i < ioParallelism; i++ {
go readBatches()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to see my wg.Done() calls close to my wg.Add calls. Makes me feel like I'm not missing something. So I would do:

wg.Add(ioParallelism)
for i := 0; i < ioParallelism; i++ {
    go func() {
        defer wg.Done()
        readBatches()
    }()
}

Then remove the wg.Done from the top of your readBatches func.

…f channel close and wait group done calls.
@reltuk reltuk merged commit 9456d37 into master Sep 26, 2019
@Hydrocharged Hydrocharged deleted the aaron/nbs-io-parallelism-fixes branch November 11, 2019 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants