layer resolver: Avoid many cache misses occur when many pullings of images happen #600

ktock · 2022-01-21T01:43:02Z

When filesystem.Mount is called for the first layer in an image, stargz-snapshotter resolves all layers in that image in parallel and caches these resolved layers metadata in LRU cache. When filesystem.Mount is called for the neighbouring layers of that image, the cached layer can be used for speeding up the mounts.

However, when many pulling of images happen in parallel, layers are soonly evicted from the LRU cache by other image pullings and many cache misses happen. This result to many resource consumption (e.g. fd), many (duplicated) requests to the registry, etc.

This commit solves this by using TTL-based cache instead of LRU cache. TTL cache doesn't have size limitations and manages eviction using TTL of each element. This avoids the above problems of quick evictions and many cache misses when many parallel pulling of images happen. When filesystem.Mount of a layer of an image is called, mounts of the neighbouring layers also happen soon. And once a mount of a layer completes, reusing of the layer is managed by the snapshotter's side but not by the filesystem so we don't need to cache the layer in long term. So TTL-based cache should be a better choice than LRU cache here.

max fd consumption comparison

The following client command mounts 7 images (74 snapshots) in parallel.

TARGETS="gcc:10.2.0 golang:1.12.9 jenkins:2.60.3 node:13.13.0 php:7.3.8 python:3.9 tomcat:10.0.0-jdk15-openjdk-buster"
for I in $TARGETS ; do
  ctr-remote i rpull --plain-http registry2:5000/${I}-esgz &
done

branch	max fd consumption
main	381
This PR	211

Each snapshot consumes 1 fd for fuse, 2 fds (at most) for registry connection.
So stargz-snapshotter consumes around 3 * number_of_snapshots fds (+ misc fds for local cache, etc.).
Main branch consumes much more fds than expected because of lots of cache misses and duplicated layer resolves.

AkihiroSuda · 2022-01-24T17:32:30Z

util/cacheutil/ttlcache_test.go

+	}
+	if evicted[0] != key1 {
+		t.Errorf("1st content %q must be evicted but got %q", key1, evicted[0])
+		return


Can we just use t.Fatalf instead of t.Errorf + return

Thank you for the review. fixed.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

ktock force-pushed the cachettl branch 2 times, most recently from 87f4b21 to 8a480a6 Compare January 21, 2022 02:00

AkihiroSuda reviewed Jan 24, 2022

View reviewed changes

Avoid many cache misses occur when many pullings of images happen

16166d7

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

ktock force-pushed the cachettl branch from 8a480a6 to 16166d7 Compare January 24, 2022 23:58

AkihiroSuda approved these changes Jan 25, 2022

View reviewed changes

AkihiroSuda merged commit 38baee4 into containerd:main Jan 25, 2022

ktock deleted the cachettl branch February 25, 2022 03:01

ktock mentioned this pull request Mar 2, 2022

Issue pulling in a busy host - too many open files #526

Closed

zhiyuanGH mentioned this pull request Apr 24, 2023

Is there an explicit way of clearing cache? #1213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layer resolver: Avoid many cache misses occur when many pullings of images happen #600

layer resolver: Avoid many cache misses occur when many pullings of images happen #600

ktock commented Jan 21, 2022

AkihiroSuda Jan 24, 2022

ktock Jan 25, 2022

layer resolver: Avoid many cache misses occur when many pullings of images happen #600

layer resolver: Avoid many cache misses occur when many pullings of images happen #600

Conversation

ktock commented Jan 21, 2022

max fd consumption comparison

AkihiroSuda Jan 24, 2022

Choose a reason for hiding this comment

ktock Jan 25, 2022

Choose a reason for hiding this comment