Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve file read performance #105

Merged
merged 2 commits into from
Jun 4, 2020
Merged

Improve file read performance #105

merged 2 commits into from
Jun 4, 2020

Conversation

ktock
Copy link
Member

@ktock ktock commented Jun 1, 2020

Throughout the benchmark on #101, it turned out the file read performance is low especially on random and parallel reads. This PR includes two patches to mitigate this problem and improve file read performance.

The first patch is for squashing HTTP Range Request header on large prefetch. Current filesystem implementation chunks the prefetching range into small, many, and mostly neighbouring chunks and lists them in a single HTTP Range Request header without enough squashing. Sometimes this leads to large HTTP header and the prefetch request fails. This commit solves this by aggressively squashing neighbouring/overwrapping chunks in HTTP headers, as much as possible.

The second one contains the following optimizations for caches:

  • minimizing the occurrence of byte slice allocation in the execution path of file read, leveraging sync.Pool,
  • minimizing the memory copy and disk I/O by allowing to fetch a partials range of blobs from the cache, and
  • minimizing the locked region in the cache.

fio results

measured on Github Actions ubuntu-18.04 runner.

fio conf

bs=4k
rw=randread
size=256M
numjobs=4

master branch

See https://github.com/ktock/stargz-snapshotter/actions/runs/121204873 for more details

image type fio summary memory consumption(in KB, ps output just after the fio completion)
legacy(ext4+overlayfs) READ: bw=32.6MiB/s (34.2MB/s), 8343KiB/s-8363KiB/s (8543kB/s-8564kB/s), io=1024MiB (1074MB), run=31346-31422msec VSZ: 873880, RSS: 32040
stargz READ: bw=1330KiB/s (1362kB/s), 333KiB/s-333KiB/s (341kB/s-341kB/s), io=1024MiB (1074MB), run=788154-788231msec VSZ: 3204508, RSS: 1925984
eStargz READ: bw=1571KiB/s (1609kB/s), 393KiB/s-479KiB/s (402kB/s-490kB/s), io=1024MiB (1074MB), run=547678-667481msec VSZ: 3813632, RSS: 1994172

PR branch

See https://github.com/ktock/stargz-snapshotter/actions/runs/121255820 for more details.
The bandwidth (and latency) is improved compared to the master branch.

image type fio summary memory consumption(in KB, ps output just after the fio completion)
legacy(ext4+overlayfs) READ: bw=32.6MiB/s (34.2MB/s), 8345KiB/s-8353KiB/s (8545kB/s-8553kB/s), io=1024MiB (1074MB), run=31385-31414msec VSZ: 800168, RSS: 32316
stargz READ: bw=16.9MiB/s (17.8MB/s), 4337KiB/s-4545KiB/s (4441kB/s-4654kB/s), io=1024MiB (1074MB), run=57681-60445msec VSZ: 1618592, RSS: 261152
eStargz READ: bw=83.8MiB/s (87.9MB/s), 20.0MiB/s-22.2MiB/s (21.0MB/s-23.3MB/s), io=1024MiB (1074MB), run=11523-12217msec VSZ: 1453484, RSS: 185632

NOTE: Stargz snapshotter waits the prefetch completion for 10s by default. But in fio+eStargz benchmark, prefetch target is large (256MB * 4 files + fio binary) so it doesn't end in 10s and the container starts before the prefetch completion. If we configure this timeout longer (e.g. 30s?) the BW should be even better.

cc: @sequix

@ktock ktock mentioned this pull request Jun 1, 2020
Copy link

@sequix sequix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job! I think this is a great place for us to discuss cache. Besides what I have commented in the code, consider:

  1. Based our previsous conversation in Run KinD example got memory leak. #88,

Yes, we cache layer data in chunk granularity and have caches for some Readers used in this snapshotter.

Won't it lead to many small files in cache directory and consume up inodes eventually?

2.Why do we need a ref-cnt-lru cache? From my point, there is still global lock involved in there.

cache/cache.go Outdated
}

// Cache the opened file for future use. If "direct" option is specified, this
// won't be done. This option is useful for priventing file cache from being
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: preventing

cache/cache.go Outdated
fmt.Printf("Warning: failed to write cache: %d(wrote)/%d(expected): %v\n",
n, len(p), err)
want := b2.Len()
if _, err := io.CopyN(f, b2, int64(want)); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to minimize inconsistency, it should write to a temp file, then mv the temp file to the dest file.

cache/cache.go Outdated
// won't be done. This option is useful for priventing file cache from being
// polluted by data that won't be accessed immediately.
if opt.direct || !dc.fileCache.add(key, f) {
f.Close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this f be deleted at some point in the future? If not, this fd will remain open until the layer is unmounted.
This equals to fd leakage when it comes to long-running program.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference count for this object is decreased when done() callback is called.

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L132

and if the ref count becomes 0 (means nobody refers to this object) by this operation, pre-defined finalization (i.e. closing the file) is invoked.

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L315-L318

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L92-L94

return
// TODO: should we cache the entire file data on memory?
// but making I/O (possibly huge) on every fetching
// might be costly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cache every thing to disk, and use a memory (with a limited size) as the cache of disk (cache is supposed to be layer by layer, just like CPU L1-L2-L3 cache -> memory -> disk -> remote server).
This will be easily achieved in the prefetch scenario, but the normal scenario.
In the normal scenario, a writeback is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that copying the entire chunk file from disk to memory on every access can be very costly. (NOTE: The client of cache doesn't always query the entire chunk file. From this perspective, the new FetchAt API in this patch allows them to specify the partial contents, which reduces unnecessary I/Os and leads to the performance improvement, as shown in the above). During writing this patch, I tackled this TODO by copying chunk file contents to memory in the background (in goroutine), but this resulted in noisy I/Os which resulted in lower performance.

Either way, I think we can keep this out of scope for this PR and tackle in another PR (contributions are welcome 😄).

// When Direct option is specified for FetchAt and Add methods, these operation
// won't use on-memory caches. When you know that the targeting value won't be
// used immediately, you can privent the limited space of on-memory caches from
// being polluted by these unimportant values.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshotter as a basic infrastructure cannot tell the difference between images and users have less willings to tune these parameters up. We should let the image speak, just let the most-recently used data in memory is enough (LRU or Ringbuffer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option isn't for snapshotter users but for filesystem (so used only internally).

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L294
https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L299

Filesystem fetches layer contents in background and stores all chunks of layer data to the cache. This is good from availabilities' perspective, but are these chunks accessed again immediately? Maybe not. Rather than them, ones accessed via FUSE reads are more likely accessed soon. In this point, we have priorities among cached contents. We have limited space on LRU memory cache so we don't want prioritized chunks (e.g. ones accessed via FUSE reads) to be evicted for storing less-important chunks (e.g. ones fetched in the background). So this patch enables the filesystem to tell that "this chunk I'm adding to the cache is less-important" using Direct option, which leads to the performance improvement as shown in the above.

@ktock
Copy link
Member Author

ktock commented Jun 1, 2020

Thanks for your comments!

  1. Based our previsous conversation in Run KinD example got memory leak. #88,
    Won't it lead to many small files in cache directory and consume up inodes eventually?

Partial layer contents are lazily pulled so if we don't any fixed-offset for chunks it might be hard to reuse them. For example, if a process on the filesystem requires a range 3-6, there is no guarantee another process also accesses the same range. This can be 2-4 or 5-7 or whatever. Making HTTP requests every time different (but possibly overlapping) ranges is required is costly. We should pre-define fixed offset like 0-4, 5-9, ... and should download 0-9 at the first access to 3-6, which helps to avoid duplicated range request on futural 2-4 or 5-7 access. But I believe we have better ways for it. The reason why we store each chunk as a file is that this was one of the simplest ways to do this as an initial implementation of the cache.

2.Why do we need a ref-cnt-lru cache? From my point, there is still global lock involved in there.

I think having on-memory cache itself helps avoid slow disk I/O but I believe we might have better ways for this. This patch wraps groupcache/lru for making it reference-aware, keeping the locked region as minimal as possible. This is because groupcache/lru is agnostic about ref count of each cached objects and this can suddenly evict the oldest object (this triggers defined finalization like closing the file) even if this is being used by someone. Ideas are very welcome if you have lock-free ways for doing this.

@sequix
Copy link

sequix commented Jun 2, 2020

Now, I have a total grasp of the read-path

                            +----------+      +-------------+
                            | registry |      |    fuse     |
                            +----+-----+      +------+------+
                                 |                   ^
                                 v                   |
                            +----+-----+      +------+--------+  fulfill fuse read requests
     merge region           | fetcher  |      | reader.Reader |  lru cache decompressed chunks
     minimize requests      +----+-----+      +------+--------+  both in memory and disk
                                 |                   ^
                                 v                   |
                            +----+-----+      +------+------+  decompress gzipped chunks
lru cache gzipped chunks    |   blob   +----->+ crfs.Reader |  provide file-level access
both in memory and disk     +----------+      +-------------+  no cache at all

And problems to tackle to acheive a better performance:

  1. minimize network load in fetcher;
  2. merge blob and crfs.Reader, or eleminate crfs.Reader, in order to have a unique cache layer;
  3. the unique cache should be as simple as possilbe (lockfree is the best)

My naive thinkings:

  1. fetcher should fetch fixed size chunk (better in os page size), it seems this is how things is done now;
  2. migrate crfs.Reader here to get that single cache.
  3. ringbuffer-based cache, it can also let the hot data in memory like lru, but with less lock granularity.

This PR lgtm, if you feel good about my thinksings above, I'll make a PR this week.

@ktock ktock changed the title Improve file read performance [WIP] Improve file read performance Jun 2, 2020
@ktock
Copy link
Member Author

ktock commented Jun 2, 2020

@sequix

Now, I have a total grasp of the read-path

Great! Thanks a lot!

  • minimize network load in fetcher;
  • merge blob and crfs.Reader, or eleminate crfs.Reader, in order to have a unique cache layer;
  • the unique cache should be as simple as possilbe (lockfree is the best)

Yes, all of them are still problems.

fetcher should fetch fixed size chunk (better in os page size), it seems this is how things is done now;

Though I don't know what is the best chunk size, fixed-size chunks are currently effective for performance.

migrate crfs.Reader here to get that single cache.

I agree with this.

ringbuffer-based cache, it can also let the hot data in memory like lru, but with less lock granularity.

Though we eventually should rely on benchmarking for evaluating how beneficial ring-buffer-based cache is, I think this is worth trying.

There are rooms for discussion in terms of which chunk should be stored on memory,

  • Storing all accessed chunks on memory: Ideal but there possibly cause performance problem Improve file read performance #105 (comment) but we possibly can mitigate it by carefully tuning the chunk size.
  • Using on-memory cache as a temporary space for chunks that are write-in-progress to disk: this PR takes this approach. It works but maybe we have better ways.

@ktock ktock requested a review from AkihiroSuda June 2, 2020 06:57
@ktock ktock changed the title [WIP] Improve file read performance Improve file read performance Jun 2, 2020
ktock added 2 commits June 4, 2020 13:26
During prefetch, current filesystem implementation chunks this prefetching range
into small, many, and mostly neighbouring chunks and lists them in a single HTTP
Range Request header without enough squashing. Sometimes this leads to large
HTTP header and the request fails. This commit solves this by aggressively
squashing neighbouring/overwrapping chunks in HTTP headers, as much as
possible.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Throughout the benchmarking in the community, it turned out the file read
performance is low especially on random and parallel reads.

This commit solves this by the following fixes:
- minimizing the occurrence of slice allocation in the execution path of file
  read, leveraging sync.Pool,
- minimizing the memory copy and disk I/O by allowing to fetch a partials range
  of blobs from the cache, and
- minimizing the locked region in the cache.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@ktock
Copy link
Member Author

ktock commented Jun 4, 2020

@AkihiroSuda PTAL:pray:

@AkihiroSuda AkihiroSuda merged commit 8f92e35 into containerd:master Jun 4, 2020
@ktock ktock deleted the fs-p branch June 4, 2020 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants