Skip to content

fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout#322

Merged
jrobotham-square merged 3 commits into
mainfrom
extend-token-buffer-add-lfs-timeout
Jun 1, 2026
Merged

fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout#322
jrobotham-square merged 3 commits into
mainfrom
extend-token-buffer-add-lfs-timeout

Conversation

@jrobotham-square
Copy link
Copy Markdown
Contributor

@jrobotham-square jrobotham-square commented Jun 1, 2026

Problem

LFS snapshot jobs have been failing with repeated Bad credentials from the
GitHub LFS batch API. Individual jobs have been observed running for 12–26
hours
before something finally killed them:

git lfs fetch: Fetching reference refs/heads/main
batch response: Bad credentials
batch response: Bad credentials
... x6
error: failed to fetch some objects from '.../info/lfs'
: exit status 2, elapsed=25h58m40s

Root cause

GitHub App installation tokens have a fixed 1 h server-side TTL (not
configurable). The TokenManager cache served tokens until they had only
RefreshBuffer = 5m of validity remaining. That meant an LFS fetch starting
near a cache boundary could be handed a token with as little as ~5 minutes of
life left. If the fetch ran longer than that, the token expired mid-flight.
git-lfs's batch endpoint then retried with the same expired token (it's
baked into credential.helper as a literal), turning every retry into a 401,
and there was no per-subprocess timeout to stop the retry storm.

Fix

Two coordinated changes that make lfsFetchTimeout < RefreshBuffer an
airtight invariant:

  1. internal/githubapp/config.go — bump RefreshBuffer from 5m to
    30m. Every token handed out now has at least 30 m of validity remaining.

  2. internal/strategy/git/snapshot.go — wrap git lfs fetch in
    context.WithTimeout(ctx, 25*time.Minute). Bounds the retry-storm
    runaway and guarantees no subprocess can outlive the token baked into
    its environment.

The 25 m < 30 m invariant is the airtight piece: the longest possible LFS
subprocess is shorter than the minimum remaining validity of any token it
could be handed, so a "Bad credentials" from token expiry should now be
impossible by construction on the LFS path.

Why not a refresh-mid-subprocess approach?

That alternative (file-backed credential helper + background refresh
goroutine) is much more invasive — new shell-form helper, temp-file
lifecycle, refresh goroutine, symlink-safety, ~150 LOC + tests. It only
buys us coverage of subprocesses that legitimately run longer than 1 h,
which our metrics don't show.

@jrobotham-square jrobotham-square requested a review from a team as a code owner June 1, 2026 01:12
@jrobotham-square jrobotham-square requested review from stuartwdouglas and removed request for a team June 1, 2026 01:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d42735e155

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/strategy/git/snapshot.go
LFS snapshot jobs were observed failing with repeated 'Bad credentials'
from the GitHub LFS batch API, with single jobs churning for 12-26 hours
before giving up.

Root cause: a GitHub App installation token has a fixed 1 h server-side
TTL, and the TokenManager cache served tokens until they had only 5 m of
validity remaining. An LFS fetch that started near a cache boundary and
ran longer than 5 m would exhaust its token mid-flight; git-lfs's
internal batch retries then re-used the same expired token, producing a
retry storm that ran until something else killed the subprocess.

Two coordinated changes:

- internal/githubapp/config.go: RefreshBuffer 5m -> 30m. Every token
  handed out now has at least 30 m of validity remaining.

- internal/gitclone/manager.go, internal/strategy/git/snapshot.go: new
  LFSFetchTimeout (default 25m) wrapping 'git lfs fetch'. Bounds the
  retry-storm runaway and keeps subprocess lifetime below RefreshBuffer
  so a baked-in token can't expire mid-fetch.

The 25m < 30m invariant is the airtight piece: the longest possible
LFS subprocess (25 m) is shorter than the minimum remaining validity
of any token it could be handed (30 m).

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5
@jrobotham-square jrobotham-square force-pushed the extend-token-buffer-add-lfs-timeout branch from d42735e to 9575ac8 Compare June 1, 2026 01:15
git-lfs spawns transfer helpers that inherit our stdout/stderr pipes, so
killing only the top-level git process can leave CombinedOutput blocked
indefinitely. Same pattern used by manager.go for clone/fetch.

Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5
Co-authored-by: Amp <amp@ampcode.com>
@jrobotham-square
Copy link
Copy Markdown
Contributor Author

Good catch — fixed in 1cb340a. Applied the same Setpgid + Cancel-with-syscall.Kill(-pid, SIGKILL) pattern the clone/fetch paths in manager.go already use, so the 25 m deadline reliably tears down git-lfs and its transfer helpers.

@jrobotham-square jrobotham-square merged commit 82f252d into main Jun 1, 2026
8 checks passed
@jrobotham-square jrobotham-square deleted the extend-token-buffer-add-lfs-timeout branch June 1, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants