fix: per-agent snapshot clones for git mirrors#3789
Conversation
db1d936 to
1ef8708
Compare
1ef8708 to
3491f9e
Compare
3491f9e to
afc407b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: afc407b351
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
afc407b to
8aea398
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8aea3980ab
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
8aea398 to
c5aebaf
Compare
There was a problem hiding this comment.
💡 Codex Review
agent/internal/job/checkout.go
Line 410 in c5aebaf
In the cache-miss branch, updateGitMirror clones mirrorDir and returns it directly instead of routing through snapshotMirror. That means the first build that populates a mirror (with CleanCheckout=true) still performs checkout/submodule clones against the mutable shared mirror, so a concurrent job can fetch/gc that mirror and invalidate alternates during this build—the same churn failure this change is meant to avoid. Returning a per-agent snapshot here would close that remaining race window.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c5aebaf to
204fab0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 204fab0334
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2830ee2 to
fd1d36b
Compare
zhming0
left a comment
There was a problem hiding this comment.
I can understand the code change, I mainly have questions. I think it'd be better if we describe the failure domain and explain how this fix will change that.
| if !e.CleanCheckout { | ||
| return mirrorDir, nil | ||
| } |
There was a problem hiding this comment.
Is this fix only applicable to clean checkout?
There was a problem hiding this comment.
If clean checkout is disabled, the agent will try to reuse an existing checkout in later jobs for the same repo. That checkout could exist forever. If we used snapshots for that case, the checkout will refer to that snapshot created when the checkout was cloned, potentially forever. So we can't delete the snapshot, or the checkout would be corrupted. Since the snapshots and mirror will drift as the mirror is updated over time, the snapshots (that we can't delete) will start taking space. Plus if the mirror is updated, the snapshot won't be updated, so any update to the checkout will redundantly fetch from upstream - so we'd want more logic to avoid updating the mirror unless the checkout is a fresh clone.
There was a problem hiding this comment.
I see the rationale now though I wonder if it means a missed opportunities for all ephemeral agents?
There was a problem hiding this comment.
Perhaps... though I'm not entirely sure what you mean.
We can enable snapshotting for more cases later on?
de4d40a to
d796ca5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de4d40a600
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
zhming0
left a comment
There was a problem hiding this comment.
Looks good to me. Only have a question around if it can be used more broadly.
d796ca5 to
2128514
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2128514bba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2128514 to
05ba7c0
Compare
05ba7c0 to
169b5ee
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 169b5eed1b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Automatically remove it during teardown | ||
| e.cleanupDirs = append(e.cleanupDirs, snapshotDir) |
There was a problem hiding this comment.
Preserve snapshot lifetime across split job phases
snapshotMirror always queues the snapshot for teardown via cleanupDirs, but tearDown runs at the end of each executor invocation, including checkout-only runs. In phase-split environments (checkout and command in separate containers), this deletes the snapshot before the command phase starts; with git-mirror-checkout-mode=reference (the default), the checkout still relies on that snapshot through alternates, so later git operations in the command phase can fail with missing-object errors. This affects jobs with git mirrors + clean checkout enabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actually a good point - guess we should disable snapshots in split-phase scenarios.
Description
Fix the Big Git Mirrors Problem a different way.
Context
https://linear.app/buildkite/issue/PS-1577
cc #2208
Changes
When updating a git mirror, the update now creates a snapshot of the mirror, which is just a bare clone in a neighbouring directory, before releasing the update file lock. Done this way, the snapshot should share the underlying Git objects via hardlinks, so it can be made quite quickly and with negligible extra storage (at least until the mirror starts drifting). The checkout (a reference clone) will thus not be affected if the mirror's object start drifting.
This only applies if clean checkout is enabled, because that makes it easy to decide when to clean up each snapshot. It's recommended to use clean checkouts with git mirrors. We can figure out the other cases later.
Also refactored
umasklogic from internal/artifact into internal/osutil to share with the new logic here.Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
I did not use AI tools to create this PR - aside from Codex reviewing it (whether or not I want it to, but all changes in response to Codex's comments were still written by me).