-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: allow fast-cache hits on custom BlobSource #6211
Conversation
Again, not 100% sure how to test this 馃 |
@jedevc I agree it's possible that this is involved somehow, but I'm not yet fully connecting the dots as to how it would explain it. The reason we switched to the The problem in #6163 was that the caching all worked as expected when only local cache was being used (which would include the blob source and associated The expected behavior would have been that even though remote caching is being used and that remote cache doesn't have layers for the I can't currently see why those local cache hits would disappear just when a remote cache source is added. Again, totally possible the For repro-ing/testing, the integ tests in |
One thing worth checking too would be if this can be repro'd in a buildkit integ test (i.e. one of the (just an unsolicited idea 馃槄, obviously do whatever debugging process you feel is best) |
Right, sorry, this is a bit of an unstructured comment.
Hm, I think I've missed some context then. My understanding of #6163 is:
Aha, ok this was a subtlety that I'd missed (small aside - the git blame for this entire file goes back to #5757, which is a 35k line diff. It's almost impossible to track the history of any specific functionality through that, and so the justification for this and the fact that it changed is kind of lost. I wonder if for some PRs it might make more sense to rebase+merge instead of squash+merge - I read an interesting piece on this recently: https://gist.github.com/mitchellh/319019b1b8aac9110fcfb1862e0c97fb). This is kind of strange though. Even with
Not quite sure on this myself, but, a side-note that I think is relevant:
Back to the question - why does this not work with a remote cache? Because The mysteries remaining for me are:
|
Oh it's possible I misunderstood too. @marcosnils can you confirm whether the problem only happens when the local cache is empty and remote cache is enabled? Or does it happen when there is local cache from previous runs and the remote cache is enabled?
Ah sorry 馃槥 That PR in particular was just bad because it was very hard to split up into smaller chunks, but unsquashed merged would have also been a mess. In general I'm a fan of comments to clarify non-obvious things like this, but that's also my bad then since I should have left a comment on what was going on there. Of course always feel free to just ping me if something is not obvious from the history, but I'll try to keep this in mind more going forward (both comments and using git merge when commits are very cleanly separated).
I'm not following this, you just wouldn't set ReadOnly or ForceNoOutput when creating the mount in LLB in the integ test, right? Those are not the defaults in LLB or anything.
Do you have any more insight into the exact machinations here? I am familiar w/ the whole ReadOnly/ForceNoOutput behavior around content-based caching, but Basically the two places I see in buildkit where
I'm just missing the place where this would prevent there from being a fast cache map relative to any other llb op.
Right so some more background info if you aren't already aware:
No precise point to be made here, just making sure you're aware of all that
If this is indeed what it comes down to and buildkit considers the current behavior expected, I wonder if one option would be a tiny upstream change that allows LLB users to force use of slow content based caching even when read-only and forcenooutput are not set? Could be very trivial and provided the performance hit is not overly noticeable, totally fine for us. |
Ah, that's definitely me being unclear. I was referring to upstreams local sources, where it's the randomized session id here that means that the fast cache is effectively ignored (because the generated cache key contains it). However, digging further it definitely appears that buildkit is doing something odd - if I fix the session ID to be constant with a hack (otherwise it's randomized, and then it doesn't even match with local cache), then the fast-cache works, but only with the local cache. In an example kind of like: img := llb.Image("alpine:latest@sha256:34871e7290500828b39e22294660bee86d966bc0017544e848dd9a255cdf59e0")
img = img.Run(llb.Args([]string{"apk", "add", "curl"})).Root()
run := img.Run(llb.Args([]string{"/bin/sh", "-c", "sleep 10 && echo foo > /foo"}))
run.AddMount("/mount", llb.Local("context"))
img = run.Root()
dt, err := img.Marshal(context.TODO(), llb.LinuxAmd64)
if err != nil {
panic(err)
} (this is why it's tricky to do an integration test with local source - you'd need to fix the session id, which is how to emulate what our blob source is doing and setting a unique In this kind of scenario, I'm not 100% sure what the right behavior for upstream would be, but it feels like it should either:
Note that this scenario isn't actually possible with the dockerfile implementation (since we can often use content caching), and because the |
yes, the problem only happens when the local cache is empty and remote cache is enabled. Otherwise, it works as intended as local cache is used. |
Good to know, thanks for clarifying. That definitely makes more sense with Justin's findings so far; it seems that there may be a general issue w/ buildkit not being able to have a remote cache match w/ a Also been dm'ing w/ Justin about this, he found a relatively straightforward upstream change that might fix this general problem, but we need some more confirmation. And even if it doesn't work out for whatever reason we have more options available too. |
7af6f88
to
1e68b78
Compare
By prefixing the returned CacheKey with "session:" we activated the code path in buildkit that randomizes the digest to ensure that we don't get fast-cast hits (since session ids can be reused). However, unlike LocalSources, BlobSources are not tied to a specific session (after initializing them), so we actually want to avoid this behavior. With this change, digests aren't randomized, so we can get the fast-cache hits. However, this isn't fully correct - see the comments for details. Signed-off-by: Justin Chadwell <me@jedevc.com>
1e68b78
to
14c5e83
Compare
Ok, so to summarize:
I think there's a good argument for merging this PR change anyways (with a lot of comments and TODOs explaining.
I think that because of the Longer term we need to not upload the @sipsma any objections to taking this as a temporary hot-fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! double checked and it does seem like the other previous fix in #5885 should cover us in terms of the issues around laziness, so I agree this is good for a hotfix pending upstream fixes 馃帀
馃帀 thx everyone of the hard work on this one. Looking forward to see what we can do to surface some of these things so we can provide more visibility to Dagger users 馃憪 |
Fixes #6163 馃帀 (cc @gerhard @marcosnils)
By prefixing the returned CacheKey with "session:" we activated the code path in buildkit that randomizes the digest to ensure that we don't get fast-cast hits (since session ids can be reused). See https://github.com/moby/buildkit/blob/06c971ffb4d3873207fe8ff7672026a718784060/solver/llbsolver/ops/source.go#L91-L94.
However, unlike LocalSources, BlobSources are not tied to a specific session (after initializing them), so we actually want to avoid this behavior.
With this change, digests aren't randomized, so we can get the fast-cache hits.