Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow fast-cache hits on custom BlobSource #6211

Merged
merged 1 commit into from
Dec 6, 2023

Conversation

jedevc
Copy link
Member

@jedevc jedevc commented Dec 4, 2023

Fixes #6163 馃帀 (cc @gerhard @marcosnils)

By prefixing the returned CacheKey with "session:" we activated the code path in buildkit that randomizes the digest to ensure that we don't get fast-cast hits (since session ids can be reused). See https://github.com/moby/buildkit/blob/06c971ffb4d3873207fe8ff7672026a718784060/solver/llbsolver/ops/source.go#L91-L94.

However, unlike LocalSources, BlobSources are not tied to a specific session (after initializing them), so we actually want to avoid this behavior.

With this change, digests aren't randomized, so we can get the fast-cache hits.

@jedevc jedevc added this to the v0.9.4 milestone Dec 4, 2023
@jedevc jedevc requested a review from sipsma December 4, 2023 18:12
@jedevc
Copy link
Member Author

jedevc commented Dec 4, 2023

Again, not 100% sure how to test this 馃

@sipsma
Copy link
Contributor

sipsma commented Dec 4, 2023

@jedevc I agree it's possible that this is involved somehow, but I'm not yet fully connecting the dots as to how it would explain it.

The reason we switched to the session: prefix is that we don't want it to be included in remote cache exports (either the standard buildkit upstream caching or our cache service). AFAIK that's the only place where that random: prefix gets interpreted in buildkit (worth a double check).

The problem in #6163 was that the caching all worked as expected when only local cache was being used (which would include the blob source and associated session:/random: stuff), but once either buildkit upstream remote caching or the cache service was used, caching stopped working as expected and we'd get misses.

The expected behavior would have been that even though remote caching is being used and that remote cache doesn't have layers for the session: prefixed op (either blob:// or local://), the local cache is still checked for those and gets a match.

I can't currently see why those local cache hits would disappear just when a remote cache source is added. Again, totally possible the session: prefix is involved somehow, but not clear yet to me how or if it's something else entirely.


For repro-ing/testing, the integ tests in core/integration/remotecache_test.go may be helpful; you could hopefully use those to repro the original case.

@jedevc jedevc marked this pull request as draft December 4, 2023 19:26
@sipsma
Copy link
Contributor

sipsma commented Dec 4, 2023

One thing worth checking too would be if this can be repro'd in a buildkit integ test (i.e. one of the client/client_test.go ones). Basically, see if use of a local:// source as a mounted directory results in cache misses when remote cache is used. The outcome there, whether repro-able or not, would help focus in on where to look next.

(just an unsolicited idea 馃槄, obviously do whatever debugging process you feel is best)

@jedevc
Copy link
Member Author

jedevc commented Dec 5, 2023

Right, sorry, this is a bit of an unstructured comment.

The problem in #6163 was that the caching all worked as expected when only local cache was being used (which would include the blob source and associated session:/random: stuff), but once either buildkit upstream remote caching or the cache service was used, caching stopped working as expected and we'd get misses.

Hm, I think I've missed some context then. My understanding of #6163 is:

  • Local cache works entirely correctly
  • Local cache with the caching service (or an upstream buildkit cache exporter) works entirely correctly
  • Just the caching service (or an upstream buildkit cache exporter) with no local cache (because we're in a clean state from a newly spawned engine) does not cache correctly.

The reason we switched to the session: prefix is that we don't want it to be included in remote cache exports (either the standard buildkit upstream caching or our cache service). AFAIK that's the only place where that random: prefix gets interpreted in buildkit (worth a double check).

Aha, ok this was a subtlety that I'd missed (small aside - the git blame for this entire file goes back to #5757, which is a 35k line diff. It's almost impossible to track the history of any specific functionality through that, and so the justification for this and the fact that it changed is kind of lost. I wonder if for some PRs it might make more sense to rebase+merge instead of squash+merge - I read an interesting piece on this recently: https://gist.github.com/mitchellh/319019b1b8aac9110fcfb1862e0c97fb).

This is kind of strange though. Even with session: used, I still end up seeing the mounted directory ending up in the remote cache - I'm genuinely not quite sure why.

I can't currently see why those local cache hits would disappear just when a remote cache source is added

Not quite sure on this myself, but, a side-note that I think is relevant:

  • WithMountedDirectory creates mounts on the future ExecOps that are run. These are not added with llb.Readonly, so we hit https://github.com/moby/buildkit/blob/06c971ffb4d3873207fe8ff7672026a718784060/solver/llbsolver/ops/exec.go#L242-L244. This means that there is no content-based slow cache run for any of these - this is the expected behavior with llb.Local. This is why this doesn't repro super nicely with an upstream test like suggested here: fix: allow fast-cache hits on custom BlobSource聽#6211 (comment).

    Local sources can only be matched with a content-based cache (because of the session:/random: behavior). This means that the only way to cache any local:// source then is to either use it as an input to a FileOp/similar that uses content-based caching unconditionally or to use it as an llb.Readonly non-root mount in an ExecOp.

    We can test that this is relevant by modifying our dagger code to always add llb.Readonly for testing (which let me to the unrelated realization in fix: ensure readonly option is correctly propagated聽#6209) - if we do this, then the issue in 馃悶 WithMountedDirectory invalidating remote cache聽#6163 goes away - this is because by setting llb.Readonly on the mount we allow for the content-based caching, which means that we can match on any blob with same content (even when they have different cache keys because of session:/random:). However, this isn't a good solution, since llb.Readonly makes the filesystem read-only :)

    LLB also has llb.ForceNoOutput to enable the slow cache here... which would be really nice - but we actually do need the output, since we use it. If I disable this and set llb.ForceNoOutput 馃悶 WithMountedDirectory invalidating remote cache聽#6163 resolves.

    ...long tangent here, but the TL;DR is that all of the above enables the slow-cache - which points to the issue actually being related to failing to match the fast cache.

Back to the question - why does this not work with a remote cache? Because session: means that we use a noopRecord, and so we don't match the fast cache (but we are still able to match the slow cache in some cases).


The mysteries remaining for me are:

  • Why do we seem to upload the mounted directory contents? session: should mean we don't do this.
  • I expect buildkit to be able to match partial cache chains here - even if the local source isn't cached, things on top of that should be able to be cached, but that doesn't appear to be happening...

@sipsma
Copy link
Contributor

sipsma commented Dec 5, 2023

Hm, I think I've missed some context then. My understanding of #6163 is:

Oh it's possible I misunderstood too. @marcosnils can you confirm whether the problem only happens when the local cache is empty and remote cache is enabled? Or does it happen when there is local cache from previous runs and the remote cache is enabled?

Aha, ok this was a subtlety that I'd missed (small aside - the git blame for this entire file goes back to #5757, which is a 35k line diff. It's almost impossible to track the history of any specific functionality through that, and so the justification for this and the fact that it changed is kind of lost.

Ah sorry 馃槥 That PR in particular was just bad because it was very hard to split up into smaller chunks, but unsquashed merged would have also been a mess.

In general I'm a fan of comments to clarify non-obvious things like this, but that's also my bad then since I should have left a comment on what was going on there.

Of course always feel free to just ping me if something is not obvious from the history, but I'll try to keep this in mind more going forward (both comments and using git merge when commits are very cleanly separated).

This is why this doesn't repro super nicely with an upstream test like suggested here

I'm not following this, you just wouldn't set ReadOnly or ForceNoOutput when creating the mount in LLB in the integ test, right? Those are not the defaults in LLB or anything.

Local sources can only be matched with a content-based cache (because of the session:/random: behavior).

Do you have any more insight into the exact machinations here? I am familiar w/ the whole ReadOnly/ForceNoOutput behavior around content-based caching, but random: specifically making it impossible for fast cache matches to be hit is new to me.

Basically the two places I see in buildkit where random: gets interpreted are:

  1. In the remote cache chain code, where random: records are skipped (as we know about), here
  2. When creating the solver cache manager's cache key for a root (i.e. source) vertex, here
    • But afaik all this does is retain the random: prefix rather than sha256: rather than actually change any behavior

I'm just missing the place where this would prevent there from being a fast cache map relative to any other llb op.

This means that the only way to cache any local:// source then is to either use it as an input to a FileOp/similar that uses content-based caching unconditionally

Right so some more background info if you aren't already aware:

  1. We internally do an llb copy of the local import: https://github.com/sipsma/dagger/blob/96b73cf711dea3b28b47923e4bb3ce5350e8d537/engine/buildkit/filesync.go#L68-L68
    • This is mostly so that we allow the local source snapshot to remain mutable and thus re-usable for subsequent imports, but has an influence on the caching too
    • We've also pretty much always done that, not a new thing with blob://
  2. However, we do the blob:// creation after that, so then we end up re-using a session: op directly again: https://github.com/sipsma/dagger/blob/96b73cf711dea3b28b47923e4bb3ce5350e8d537/engine/buildkit/filesync.go#L133-L133
  3. There was a previous issue in the same sort of area as all this that resulted in a fix and regression test, as described here https://github.com/sipsma/dagger/blob/96b73cf711dea3b28b47923e4bb3ce5350e8d537/core/integration/remotecache_test.go#L123-L123

No precise point to be made here, just making sure you're aware of all that

he TL;DR is that all of the above enables the slow-cache - which points to the issue actually being related to failing to match the fast cache.

If this is indeed what it comes down to and buildkit considers the current behavior expected, I wonder if one option would be a tiny upstream change that allows LLB users to force use of slow content based caching even when read-only and forcenooutput are not set? Could be very trivial and provided the performance hit is not overly noticeable, totally fine for us.

@jedevc
Copy link
Member Author

jedevc commented Dec 5, 2023

Do you have any more insight into the exact machinations here? I am familiar w/ the whole ReadOnly/ForceNoOutput behavior around content-based caching, but random: specifically making it impossible for fast cache matches to be hit is new to me.

Ah, that's definitely me being unclear. I was referring to upstreams local sources, where it's the randomized session id here that means that the fast cache is effectively ignored (because the generated cache key contains it).

However, digging further it definitely appears that buildkit is doing something odd - if I fix the session ID to be constant with a hack (otherwise it's randomized, and then it doesn't even match with local cache), then the fast-cache works, but only with the local cache. In an example kind of like:

	img := llb.Image("alpine:latest@sha256:34871e7290500828b39e22294660bee86d966bc0017544e848dd9a255cdf59e0")
	img = img.Run(llb.Args([]string{"apk", "add", "curl"})).Root()

	run := img.Run(llb.Args([]string{"/bin/sh", "-c", "sleep 10 && echo foo > /foo"}))
	run.AddMount("/mount", llb.Local("context"))
	img = run.Root()

	dt, err := img.Marshal(context.TODO(), llb.LinuxAmd64)
	if err != nil {
		panic(err)
	}

(this is why it's tricky to do an integration test with local source - you'd need to fix the session id, which is how to emulate what our blob source is doing and setting a unique session:)

In this kind of scenario, random: prefixes don't get pushed to a remote cache. What's super weird, is that we actually are pushing the layer containing just /foo in the cache - but then it doesn't actually seem possible to match on it - even with an identical definition.

I'm not 100% sure what the right behavior for upstream would be, but it feels like it should either:

  • Match the /foo layer to cache the expensive sleep step (which is what we want)
  • Don't ever produce the /foo layer since it would never match

Note that this scenario isn't actually possible with the dockerfile implementation (since we can often use content caching), and because the local:// source is always randomized with session:. It feels like because we don't do either of those, we've fallen right into an edge case that has always been upstream.

@marcosnils
Copy link
Contributor

marcosnils commented Dec 6, 2023

@marcosnils can you confirm whether the problem only happens when the local cache is empty and remote cache is enabled? Or does it happen when there is local cache from previous runs and the remote cache is enabled?

yes, the problem only happens when the local cache is empty and remote cache is enabled. Otherwise, it works as intended as local cache is used.

@sipsma
Copy link
Contributor

sipsma commented Dec 6, 2023

yes, the problem only happens when the local cache is empty and remote cache is enabled. Otherwise, it works as intended as local cache is used.

Good to know, thanks for clarifying. That definitely makes more sense with Justin's findings so far; it seems that there may be a general issue w/ buildkit not being able to have a remote cache match w/ a local:// or blob:// source unless it's specifically mounted ReadOnly or ForceNoOutput, neither of which are things users can do w/ dagger today (and that in general they shouldn't be forced to do even if we exposed them as options).

Also been dm'ing w/ Justin about this, he found a relatively straightforward upstream change that might fix this general problem, but we need some more confirmation. And even if it doesn't work out for whatever reason we have more options available too.

@jedevc jedevc force-pushed the fix-randomized-blob-cache-id branch from 7af6f88 to 1e68b78 Compare December 6, 2023 11:58
By prefixing the returned CacheKey with "session:" we activated the code
path in buildkit that randomizes the digest to ensure that we don't get
fast-cast hits (since session ids can be reused).

However, unlike LocalSources, BlobSources are not tied to a specific
session (after initializing them), so we actually want to avoid this
behavior. With this change, digests aren't randomized, so we can get the
fast-cache hits.

However, this isn't fully correct - see the comments for details.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc force-pushed the fix-randomized-blob-cache-id branch from 1e68b78 to 14c5e83 Compare December 6, 2023 12:08
@jedevc
Copy link
Member Author

jedevc commented Dec 6, 2023

Ok, so to summarize:

  • We have an upstream buildkit issue where stable random: IDs will not ever match any definition-based fast cache - I'll follow this up later today, and open a proposed fix.
  • We have a dagger issue where we upload the host directory into the remote cache because of the use of the llb.Copy.

I think there's a good argument for merging this PR change anyways (with a lot of comments and TODOs explaining.

The reason we switched to the session: prefix is that we don't want it to be included in remote cache exports (either the standard buildkit upstream caching or our cache service). AFAIK that's the only place where that random: prefix gets interpreted in buildkit (worth a double check).

I think that because of the llb.Copy thing we're doing today, this change doesn't cause any extra content to be uploaded to the cache (from where we are today) - it also fixes a real-world issue that we can solve when we release tomorrow.

Longer term we need to not upload the llb.Copy cache - this is maybe a bit tricky in it's own right, but then we also need to solve the random: caching issue that I'll follow up with later.

@sipsma any objections to taking this as a temporary hot-fix?

@jedevc jedevc marked this pull request as ready for review December 6, 2023 14:27
Copy link
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! double checked and it does seem like the other previous fix in #5885 should cover us in terms of the issues around laziness, so I agree this is good for a hotfix pending upstream fixes 馃帀

@jedevc jedevc merged commit a789dbe into dagger:main Dec 6, 2023
82 checks passed
@marcosnils
Copy link
Contributor

marcosnils commented Dec 6, 2023

馃帀 thx everyone of the hard work on this one. Looking forward to see what we can do to surface some of these things so we can provide more visibility to Dagger users 馃憪

@jedevc jedevc deleted the fix-randomized-blob-cache-id branch December 6, 2023 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

馃悶 WithMountedDirectory invalidating remote cache
3 participants