Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detection of missing commits on checkout #2322

Merged
merged 1 commit into from Aug 28, 2023

Conversation

goodspark
Copy link
Contributor

I upgraded from 3.31 to 3.52 and started seeing 'reference is not a tree' errors in CI. Something in the Buildkite code was not catching this properly and causing the agent to recreate the repo.

In my case, this was being caused by force pushes to branches with open PRs happening soon after a previous push.

$ git fetch -v --prune -- origin refs/pull/123/head
...
From github.com:my/repo
 * branch                    refs/pull/123/head -> FETCH_HEAD
$ git checkout -f 7890000000
fatal: reference is not a tree: 7890000000
⚠️ Warning: Checkout failed! checking out commit "7890000000": exit status 128 (Attempt 1/3 Retrying in 2s)
$ git clone -v -- git@github.com:my/repo.git .
Cloning into '.'...
...
$ git clean -fdqx -e /.local
$ git fetch -v --prune -- origin refs/pull/123/head

Based on #2286, I assume something similar is happening - the output of the process isn't being properly captured.

I upgraded from 3.31 to 3.52 and started seeing 'reference is not a
tree' errors in CI. Something in the Buildkite code was not catching
this properly and causing the agent to recreate the repo.

In my case, this was being caused by force pushes to branches with open
PRs happening soon after a previous push.

```
$ git fetch -v --prune -- origin refs/pull/123/head
...
From github.com:my/repo
 * branch                    refs/pull/123/head -> FETCH_HEAD
$ git checkout -f 7890000000
fatal: reference is not a tree: 7890000000
⚠️ Warning: Checkout failed! checking out commit "7890000000": exit status 128 (Attempt 1/3 Retrying in 2s)
$ git clone -v -- git@github.com:my/repo.git .
Cloning into '.'...
...
$ git clean -fdqx -e /.local
$ git fetch -v --prune -- origin refs/pull/123/head
```

Based on buildkite#2286, I assume
something similar is happening - the output of the process isn't being
properly captured.
Copy link
Contributor

@triarius triarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @goodspark. I had a play around on v3.31, and I doubt the old code ever detected the error 🤷‍♂️ But I could only simulate git writing this error message, I could not reproduce it, even when I force pushed on PRs builds.

So it seems there is some regression in the agent between v3.31 and now that causes git to get into this state. So, to help us debug that, can you link us to some builds - use support@buildkite.com if you would like you keep the links private.

@triarius triarius merged commit b2f45d2 into buildkite:main Aug 28, 2023
1 check passed
@goodspark goodspark deleted the fix-missing-commit branch August 28, 2023 15:32
@goodspark
Copy link
Contributor Author

The old code detected the error fine, as this often happened for me and the job would just exit with an error instead of trying to clean and reclone the repo (which, given the size of the repo I'm working with, started causing time outs and job backups because of increased job times). There is no corrupted state IMO - this is just caused by force pushes to a branch I think.

@triarius
Copy link
Contributor

triarius commented Aug 29, 2023

Thanks for the links to the logs, and the additional context. I think I see what's going on.

This code in v3.31:

agent/bootstrap/git.go

Lines 47 to 54 in eade03a

if err = sh.Run("git", commandArgs...); err != nil {
if strings.HasPrefix(err.Error(), `fatal: reference is not a tree: `) {
return &gitError{error: err, Type: gitErrorCheckoutReferenceIsNotATree}
}
return &gitError{error: err, Type: gitErrorCheckout}
}
return nil

was not (and I suspect ever) detecting the that git exited with an error and printed "fatal: reference is not a tree: ". It did not return a gitError of type gitErrorCheckoutReferenceIsNotATree, but would instead return a one type gitErrorCheckout.

But, the code in

agent/bootstrap/bootstrap.go

Lines 1052 to 1063 in eade03a

if ge, ok := err.(*gitError); ok {
switch ge.Type {
// These types can fail because of corrupted checkouts
case gitErrorClone:
case gitErrorClean:
case gitErrorCleanSubmodules:
// do nothing, this will fall through to destroy the checkout
default:
return err
}
}

did not wipe the checkout directory for either gitErrorCheckout or gitErrorCheckoutReferenceIsNotATree. So in v3.31 you were seeing that it did not wipe the checkout directory when you force pushed.

This situation changed in v3.48 which included this PR: https://github.com/buildkite/agent/pull/2137/files which introduced a new error type, gitCheckoutRetryClean which would cause the checkout directory to be wiped. This is the code:

agent/bootstrap/git.go

Lines 58 to 70 in f7a91c2

if err := sh.Run(ctx, "git", commandArgs...); err != nil {
if strings.HasPrefix(err.Error(), "fatal: reference is not a tree: ") {
return &gitError{error: err, Type: gitErrorCheckoutReferenceIsNotATree}
}
// 128 is extremely broad, but it seems permissions errors, network unreachable errors etc, don't result in it
if exitErr := new(exec.ExitError); errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
return &gitError{error: err, Type: gitErrorCheckoutRetryClean}
}
return &gitError{error: err, Type: gitErrorCheckout}
}
return nil

As you can see, it was intended that an exit code of 128 with fatal: reference is not a tree: would be returned earlier if it was detected. But at the time, we were not aware that the mechanism to detect it did not work.

This PR will fix the detection mechanism, so I think the previous behaviour is restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants