Improve file read performance #105

ktock · 2020-06-01T09:14:00Z

Throughout the benchmark on #101, it turned out the file read performance is low especially on random and parallel reads. This PR includes two patches to mitigate this problem and improve file read performance.

The first patch is for squashing HTTP Range Request header on large prefetch. Current filesystem implementation chunks the prefetching range into small, many, and mostly neighbouring chunks and lists them in a single HTTP Range Request header without enough squashing. Sometimes this leads to large HTTP header and the prefetch request fails. This commit solves this by aggressively squashing neighbouring/overwrapping chunks in HTTP headers, as much as possible.

The second one contains the following optimizations for caches:

minimizing the occurrence of byte slice allocation in the execution path of file read, leveraging sync.Pool,
minimizing the memory copy and disk I/O by allowing to fetch a partials range of blobs from the cache, and
minimizing the locked region in the cache.

fio results

measured on Github Actions ubuntu-18.04 runner.

fio conf

bs=4k
rw=randread
size=256M
numjobs=4

master branch

See https://github.com/ktock/stargz-snapshotter/actions/runs/121204873 for more details

image type	fio summary	memory consumption(in KB, `ps` output just after the fio completion)
legacy(ext4+overlayfs)	`READ: bw=32.6MiB/s (34.2MB/s), 8343KiB/s-8363KiB/s (8543kB/s-8564kB/s), io=1024MiB (1074MB), run=31346-31422msec`	VSZ: 873880, RSS: 32040
stargz	`READ: bw=1330KiB/s (1362kB/s), 333KiB/s-333KiB/s (341kB/s-341kB/s), io=1024MiB (1074MB), run=788154-788231msec`	VSZ: 3204508, RSS: 1925984
eStargz	`READ: bw=1571KiB/s (1609kB/s), 393KiB/s-479KiB/s (402kB/s-490kB/s), io=1024MiB (1074MB), run=547678-667481msec`	VSZ: 3813632, RSS: 1994172

PR branch

See https://github.com/ktock/stargz-snapshotter/actions/runs/121255820 for more details.
The bandwidth (and latency) is improved compared to the master branch.

image type	fio summary	memory consumption(in KB, `ps` output just after the fio completion)
legacy(ext4+overlayfs)	`READ: bw=32.6MiB/s (34.2MB/s), 8345KiB/s-8353KiB/s (8545kB/s-8553kB/s), io=1024MiB (1074MB), run=31385-31414msec`	VSZ: 800168, RSS: 32316
stargz	`READ: bw=16.9MiB/s (17.8MB/s), 4337KiB/s-4545KiB/s (4441kB/s-4654kB/s), io=1024MiB (1074MB), run=57681-60445msec`	VSZ: 1618592, RSS: 261152
eStargz	`READ: bw=83.8MiB/s (87.9MB/s), 20.0MiB/s-22.2MiB/s (21.0MB/s-23.3MB/s), io=1024MiB (1074MB), run=11523-12217msec`	VSZ: 1453484, RSS: 185632

NOTE: Stargz snapshotter waits the prefetch completion for 10s by default. But in fio+eStargz benchmark, prefetch target is large (256MB * 4 files + fio binary) so it doesn't end in 10s and the container starts before the prefetch completion. If we configure this timeout longer (e.g. 30s?) the BW should be even better.

cc: @sequix

sequix

Great Job! I think this is a great place for us to discuss cache. Besides what I have commented in the code, consider:

Based our previsous conversation in Run KinD example got memory leak. #88,

Yes, we cache layer data in chunk granularity and have caches for some Readers used in this snapshotter.

Won't it lead to many small files in cache directory and consume up inodes eventually?

2.Why do we need a ref-cnt-lru cache? From my point, there is still global lock involved in there.

sequix · 2020-06-01T10:59:13Z

cache/cache.go

+		}
+
+		// Cache the opened file for future use. If "direct" option is specified, this
+		// won't be done. This option is useful for priventing file cache from being


typo: preventing

sequix · 2020-06-01T11:04:23Z

cache/cache.go

-			fmt.Printf("Warning: failed to write cache: %d(wrote)/%d(expected): %v\n",
-				n, len(p), err)
+		want := b2.Len()
+		if _, err := io.CopyN(f, b2, int64(want)); err != nil {


In order to minimize inconsistency, it should write to a temp file, then mv the temp file to the dest file.

sequix · 2020-06-01T11:06:29Z

cache/cache.go

+		// won't be done. This option is useful for priventing file cache from being
+		// polluted by data that won't be accessed immediately.
+		if opt.direct || !dc.fileCache.add(key, f) {
+			f.Close()


Will this f be deleted at some point in the future? If not, this fd will remain open until the layer is unmounted.
This equals to fd leakage when it comes to long-running program.

The reference count for this object is decreased when done() callback is called.

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L132

and if the ref count becomes 0 (means nobody refers to this object) by this operation, pre-defined finalization (i.e. closing the file) is invoked.

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L315-L318

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/cache/cache.go#L92-L94

sequix · 2020-06-01T11:46:38Z

cache/cache.go

-	return
+	// TODO: should we cache the entire file data on memory?
+	//       but making I/O (possibly huge) on every fetching
+	//       might be costly.


I think we should cache every thing to disk, and use a memory (with a limited size) as the cache of disk (cache is supposed to be layer by layer, just like CPU L1-L2-L3 cache -> memory -> disk -> remote server).
This will be easily achieved in the prefetch scenario, but the normal scenario.
In the normal scenario, a writeback is needed.

The problem here is that copying the entire chunk file from disk to memory on every access can be very costly. (NOTE: The client of cache doesn't always query the entire chunk file. From this perspective, the new FetchAt API in this patch allows them to specify the partial contents, which reduces unnecessary I/Os and leads to the performance improvement, as shown in the above). During writing this patch, I tackled this TODO by copying chunk file contents to memory in the background (in goroutine), but this resulted in noisy I/Os which resulted in lower performance.

Either way, I think we can keep this out of scope for this PR and tackle in another PR (contributions are welcome 😄).

sequix · 2020-06-01T11:55:40Z

cache/cache.go

+// When Direct option is specified for FetchAt and Add methods, these operation
+// won't use on-memory caches. When you know that the targeting value won't be
+// used immediately, you can privent the limited space of on-memory caches from
+// being polluted by these unimportant values.


Snapshotter as a basic infrastructure cannot tell the difference between images and users have less willings to tune these parameters up. We should let the image speak, just let the most-recently used data in memory is enough (LRU or Ringbuffer)

This option isn't for snapshotter users but for filesystem (so used only internally).

https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L294
https://github.com/ktock/stargz-snapshotter/blob/33f7d6fed1279b668af21e15dbf1573451c88221/stargz/fs.go#L299

Filesystem fetches layer contents in background and stores all chunks of layer data to the cache. This is good from availabilities' perspective, but are these chunks accessed again immediately? Maybe not. Rather than them, ones accessed via FUSE reads are more likely accessed soon. In this point, we have priorities among cached contents. We have limited space on LRU memory cache so we don't want prioritized chunks (e.g. ones accessed via FUSE reads) to be evicted for storing less-important chunks (e.g. ones fetched in the background). So this patch enables the filesystem to tell that "this chunk I'm adding to the cache is less-important" using Direct option, which leads to the performance improvement as shown in the above.

ktock · 2020-06-01T14:03:52Z

Thanks for your comments!

Based our previsous conversation in Run KinD example got memory leak. #88,
Won't it lead to many small files in cache directory and consume up inodes eventually?

Partial layer contents are lazily pulled so if we don't any fixed-offset for chunks it might be hard to reuse them. For example, if a process on the filesystem requires a range 3-6, there is no guarantee another process also accesses the same range. This can be 2-4 or 5-7 or whatever. Making HTTP requests every time different (but possibly overlapping) ranges is required is costly. We should pre-define fixed offset like 0-4, 5-9, ... and should download 0-9 at the first access to 3-6, which helps to avoid duplicated range request on futural 2-4 or 5-7 access. But I believe we have better ways for it. The reason why we store each chunk as a file is that this was one of the simplest ways to do this as an initial implementation of the cache.

2.Why do we need a ref-cnt-lru cache? From my point, there is still global lock involved in there.

I think having on-memory cache itself helps avoid slow disk I/O but I believe we might have better ways for this. This patch wraps groupcache/lru for making it reference-aware, keeping the locked region as minimal as possible. This is because groupcache/lru is agnostic about ref count of each cached objects and this can suddenly evict the oldest object (this triggers defined finalization like closing the file) even if this is being used by someone. Ideas are very welcome if you have lock-free ways for doing this.

sequix · 2020-06-02T02:23:49Z

Now, I have a total grasp of the read-path

                            +----------+      +-------------+
                            | registry |      |    fuse     |
                            +----+-----+      +------+------+
                                 |                   ^
                                 v                   |
                            +----+-----+      +------+--------+  fulfill fuse read requests
     merge region           | fetcher  |      | reader.Reader |  lru cache decompressed chunks
     minimize requests      +----+-----+      +------+--------+  both in memory and disk
                                 |                   ^
                                 v                   |
                            +----+-----+      +------+------+  decompress gzipped chunks
lru cache gzipped chunks    |   blob   +----->+ crfs.Reader |  provide file-level access
both in memory and disk     +----------+      +-------------+  no cache at all

And problems to tackle to acheive a better performance:

minimize network load in fetcher;
merge blob and crfs.Reader, or eleminate crfs.Reader, in order to have a unique cache layer;
the unique cache should be as simple as possilbe (lockfree is the best)

My naive thinkings:

fetcher should fetch fixed size chunk (better in os page size), it seems this is how things is done now;
migrate crfs.Reader here to get that single cache.
ringbuffer-based cache, it can also let the hot data in memory like lru, but with less lock granularity.

This PR lgtm, if you feel good about my thinksings above, I'll make a PR this week.

ktock · 2020-06-02T04:52:45Z

@sequix

Now, I have a total grasp of the read-path

Great! Thanks a lot!

minimize network load in fetcher;

merge blob and crfs.Reader, or eleminate crfs.Reader, in order to have a unique cache layer;

the unique cache should be as simple as possilbe (lockfree is the best)

Yes, all of them are still problems.

fetcher should fetch fixed size chunk (better in os page size), it seems this is how things is done now;

Though I don't know what is the best chunk size, fixed-size chunks are currently effective for performance.

migrate crfs.Reader here to get that single cache.

I agree with this.

ringbuffer-based cache, it can also let the hot data in memory like lru, but with less lock granularity.

Though we eventually should rely on benchmarking for evaluating how beneficial ring-buffer-based cache is, I think this is worth trying.

There are rooms for discussion in terms of which chunk should be stored on memory,

Storing all accessed chunks on memory: Ideal but there possibly cause performance problem Improve file read performance #105 (comment) but we possibly can mitigate it by carefully tuning the chunk size.
Using on-memory cache as a temporary space for chunks that are write-in-progress to disk: this PR takes this approach. It works but maybe we have better ways.

During prefetch, current filesystem implementation chunks this prefetching range into small, many, and mostly neighbouring chunks and lists them in a single HTTP Range Request header without enough squashing. Sometimes this leads to large HTTP header and the request fails. This commit solves this by aggressively squashing neighbouring/overwrapping chunks in HTTP headers, as much as possible. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

Throughout the benchmarking in the community, it turned out the file read performance is low especially on random and parallel reads. This commit solves this by the following fixes: - minimizing the occurrence of slice allocation in the execution path of file read, leveraging sync.Pool, - minimizing the memory copy and disk I/O by allowing to fetch a partials range of blobs from the cache, and - minimizing the locked region in the cache. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

ktock · 2020-06-04T07:34:44Z

@AkihiroSuda PTAL:pray:

ktock mentioned this pull request Jun 1, 2020

[WIP]filesystem benchmark #101

Draft

sequix reviewed Jun 1, 2020

View reviewed changes

ktock changed the title ~~Improve file read performance~~ [WIP] Improve file read performance Jun 2, 2020

ktock force-pushed the fs-p branch from 33f7d6f to 59765ea Compare June 2, 2020 03:42

ktock requested a review from AkihiroSuda June 2, 2020 06:57

ktock changed the title ~~[WIP] Improve file read performance~~ Improve file read performance Jun 2, 2020

ktock added 2 commits June 4, 2020 13:26

ktock force-pushed the fs-p branch from 59765ea to b3c5173 Compare June 4, 2020 04:26

AkihiroSuda approved these changes Jun 4, 2020

View reviewed changes

AkihiroSuda merged commit 8f92e35 into containerd:master Jun 4, 2020

ktock deleted the fs-p branch June 4, 2020 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve file read performance #105

Improve file read performance #105

ktock commented Jun 1, 2020 •

edited

Loading

sequix left a comment •

edited

Loading

sequix Jun 1, 2020

sequix Jun 1, 2020

sequix Jun 1, 2020

ktock Jun 1, 2020

sequix Jun 1, 2020

ktock Jun 1, 2020

sequix Jun 1, 2020

ktock Jun 1, 2020

ktock commented Jun 1, 2020

sequix commented Jun 2, 2020

ktock commented Jun 2, 2020

ktock commented Jun 4, 2020

Improve file read performance #105

Improve file read performance #105

Conversation

ktock commented Jun 1, 2020 • edited Loading

fio results

master branch

PR branch

sequix left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktock commented Jun 1, 2020

sequix commented Jun 2, 2020

ktock commented Jun 2, 2020

ktock commented Jun 4, 2020

ktock commented Jun 1, 2020 •

edited

Loading

sequix left a comment •

edited

Loading