-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve file read performance #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Job! I think this is a great place for us to discuss cache. Besides what I have commented in the code, consider:
- Based our previsous conversation in Run KinD example got memory leak. #88,
Yes, we cache layer data in chunk granularity and have caches for some Readers used in this snapshotter.
Won't it lead to many small files in cache directory and consume up inodes eventually?
2.Why do we need a ref-cnt-lru cache? From my point, there is still global lock involved in there.
cache/cache.go
Outdated
} | ||
|
||
// Cache the opened file for future use. If "direct" option is specified, this | ||
// won't be done. This option is useful for priventing file cache from being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: preventing
cache/cache.go
Outdated
fmt.Printf("Warning: failed to write cache: %d(wrote)/%d(expected): %v\n", | ||
n, len(p), err) | ||
want := b2.Len() | ||
if _, err := io.CopyN(f, b2, int64(want)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to minimize inconsistency, it should write to a temp file, then mv the temp file to the dest file.
cache/cache.go
Outdated
// won't be done. This option is useful for priventing file cache from being | ||
// polluted by data that won't be accessed immediately. | ||
if opt.direct || !dc.fileCache.add(key, f) { | ||
f.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this f be deleted at some point in the future? If not, this fd will remain open until the layer is unmounted.
This equals to fd leakage when it comes to long-running program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference count for this object is decreased when done()
callback is called.
and if the ref count becomes 0 (means nobody refers to this object) by this operation, pre-defined finalization (i.e. closing the file) is invoked.
return | ||
// TODO: should we cache the entire file data on memory? | ||
// but making I/O (possibly huge) on every fetching | ||
// might be costly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should cache every thing to disk, and use a memory (with a limited size) as the cache of disk (cache is supposed to be layer by layer, just like CPU L1-L2-L3 cache -> memory -> disk -> remote server).
This will be easily achieved in the prefetch
scenario, but the normal scenario.
In the normal scenario, a writeback is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that copying the entire chunk file from disk to memory on every access can be very costly. (NOTE: The client of cache doesn't always query the entire chunk file. From this perspective, the new FetchAt
API in this patch allows them to specify the partial contents, which reduces unnecessary I/Os and leads to the performance improvement, as shown in the above). During writing this patch, I tackled this TODO
by copying chunk file contents to memory in the background (in goroutine), but this resulted in noisy I/Os which resulted in lower performance.
Either way, I think we can keep this out of scope for this PR and tackle in another PR (contributions are welcome 😄).
// When Direct option is specified for FetchAt and Add methods, these operation | ||
// won't use on-memory caches. When you know that the targeting value won't be | ||
// used immediately, you can privent the limited space of on-memory caches from | ||
// being polluted by these unimportant values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshotter as a basic infrastructure cannot tell the difference between images and users have less willings to tune these parameters up. We should let the image speak, just let the most-recently used data in memory is enough (LRU or Ringbuffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option isn't for snapshotter users but for filesystem (so used only internally).
https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L294
https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L299
Filesystem fetches layer contents in background and stores all chunks of layer data to the cache. This is good from availabilities' perspective, but are these chunks accessed again immediately? Maybe not. Rather than them, ones accessed via FUSE reads are more likely accessed soon. In this point, we have priorities among cached contents. We have limited space on LRU memory cache so we don't want prioritized chunks (e.g. ones accessed via FUSE reads) to be evicted for storing less-important chunks (e.g. ones fetched in the background). So this patch enables the filesystem to tell that "this chunk I'm adding to the cache is less-important" using Direct
option, which leads to the performance improvement as shown in the above.
Thanks for your comments!
Partial layer contents are lazily pulled so if we don't any fixed-offset for chunks it might be hard to reuse them. For example, if a process on the filesystem requires a range 3-6, there is no guarantee another process also accesses the same range. This can be 2-4 or 5-7 or whatever. Making HTTP requests every time different (but possibly overlapping) ranges is required is costly. We should pre-define fixed offset like 0-4, 5-9, ... and should download 0-9 at the first access to 3-6, which helps to avoid duplicated range request on futural 2-4 or 5-7 access. But I believe we have better ways for it. The reason why we store each chunk as a file is that this was one of the simplest ways to do this as an initial implementation of the cache.
I think having on-memory cache itself helps avoid slow disk I/O but I believe we might have better ways for this. This patch wraps |
Now, I have a total grasp of the read-path
And problems to tackle to acheive a better performance:
My naive thinkings:
This PR lgtm, if you feel good about my thinksings above, I'll make a PR this week. |
Great! Thanks a lot!
Yes, all of them are still problems.
Though I don't know what is the best chunk size, fixed-size chunks are currently effective for performance.
I agree with this.
Though we eventually should rely on benchmarking for evaluating how beneficial ring-buffer-based cache is, I think this is worth trying. There are rooms for discussion in terms of which chunk should be stored on memory,
|
During prefetch, current filesystem implementation chunks this prefetching range into small, many, and mostly neighbouring chunks and lists them in a single HTTP Range Request header without enough squashing. Sometimes this leads to large HTTP header and the request fails. This commit solves this by aggressively squashing neighbouring/overwrapping chunks in HTTP headers, as much as possible. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Throughout the benchmarking in the community, it turned out the file read performance is low especially on random and parallel reads. This commit solves this by the following fixes: - minimizing the occurrence of slice allocation in the execution path of file read, leveraging sync.Pool, - minimizing the memory copy and disk I/O by allowing to fetch a partials range of blobs from the cache, and - minimizing the locked region in the cache. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@AkihiroSuda PTAL:pray: |
Throughout the benchmark on #101, it turned out the file read performance is low especially on random and parallel reads. This PR includes two patches to mitigate this problem and improve file read performance.
The first patch is for squashing HTTP Range Request header on large prefetch. Current filesystem implementation chunks the prefetching range into small, many, and mostly neighbouring chunks and lists them in a single HTTP Range Request header without enough squashing. Sometimes this leads to large HTTP header and the prefetch request fails. This commit solves this by aggressively squashing neighbouring/overwrapping chunks in HTTP headers, as much as possible.
The second one contains the following optimizations for caches:
fio results
measured on Github Actions
ubuntu-18.04
runner.fio conf
master branch
See https://github.com/ktock/stargz-snapshotter/actions/runs/121204873 for more details
ps
output just after the fio completion)READ: bw=32.6MiB/s (34.2MB/s), 8343KiB/s-8363KiB/s (8543kB/s-8564kB/s), io=1024MiB (1074MB), run=31346-31422msec
READ: bw=1330KiB/s (1362kB/s), 333KiB/s-333KiB/s (341kB/s-341kB/s), io=1024MiB (1074MB), run=788154-788231msec
READ: bw=1571KiB/s (1609kB/s), 393KiB/s-479KiB/s (402kB/s-490kB/s), io=1024MiB (1074MB), run=547678-667481msec
PR branch
See https://github.com/ktock/stargz-snapshotter/actions/runs/121255820 for more details.
The bandwidth (and latency) is improved compared to the master branch.
ps
output just after the fio completion)READ: bw=32.6MiB/s (34.2MB/s), 8345KiB/s-8353KiB/s (8545kB/s-8553kB/s), io=1024MiB (1074MB), run=31385-31414msec
READ: bw=16.9MiB/s (17.8MB/s), 4337KiB/s-4545KiB/s (4441kB/s-4654kB/s), io=1024MiB (1074MB), run=57681-60445msec
READ: bw=83.8MiB/s (87.9MB/s), 20.0MiB/s-22.2MiB/s (21.0MB/s-23.3MB/s), io=1024MiB (1074MB), run=11523-12217msec
NOTE: Stargz snapshotter waits the prefetch completion for 10s by default. But in fio+eStargz benchmark, prefetch target is large (256MB * 4 files + fio binary) so it doesn't end in 10s and the container starts before the prefetch completion. If we configure this timeout longer (e.g. 30s?) the BW should be even better.
cc: @sequix