Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layer resolver: Avoid many cache misses occur when many pullings of images happen #600

Merged
merged 1 commit into from
Jan 25, 2022

Conversation

ktock
Copy link
Member

@ktock ktock commented Jan 21, 2022

When filesystem.Mount is called for the first layer in an image, stargz-snapshotter resolves all layers in that image in parallel and caches these resolved layers metadata in LRU cache. When filesystem.Mount is called for the neighbouring layers of that image, the cached layer can be used for speeding up the mounts.

However, when many pulling of images happen in parallel, layers are soonly evicted from the LRU cache by other image pullings and many cache misses happen. This result to many resource consumption (e.g. fd), many (duplicated) requests to the registry, etc.

This commit solves this by using TTL-based cache instead of LRU cache. TTL cache doesn't have size limitations and manages eviction using TTL of each element. This avoids the above problems of quick evictions and many cache misses when many parallel pulling of images happen. When filesystem.Mount of a layer of an image is called, mounts of the neighbouring layers also happen soon. And once a mount of a layer completes, reusing of the layer is managed by the snapshotter's side but not by the filesystem so we don't need to cache the layer in long term. So TTL-based cache should be a better choice than LRU cache here.

max fd consumption comparison

The following client command mounts 7 images (74 snapshots) in parallel.

TARGETS="gcc:10.2.0 golang:1.12.9 jenkins:2.60.3 node:13.13.0 php:7.3.8 python:3.9 tomcat:10.0.0-jdk15-openjdk-buster"
for I in $TARGETS ; do
  ctr-remote i rpull --plain-http registry2:5000/${I}-esgz &
done
branch max fd consumption
main 381
This PR 211

Each snapshot consumes 1 fd for fuse, 2 fds (at most) for registry connection.
So stargz-snapshotter consumes around 3 * number_of_snapshots fds (+ misc fds for local cache, etc.).
Main branch consumes much more fds than expected because of lots of cache misses and duplicated layer resolves.

@ktock ktock force-pushed the cachettl branch 2 times, most recently from 87f4b21 to 8a480a6 Compare January 21, 2022 02:00
}
if evicted[0] != key1 {
t.Errorf("1st content %q must be evicted but got %q", key1, evicted[0])
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use t.Fatalf instead of t.Errorf + return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. fixed.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants