Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Bors starts same batch multiple times concurrently, corrupts master branch, enqueues merged PRs again #875
This is mostly a follow on from this forum post about an incident that happened with our production Bors deployment.
We are running Bors with
We received complaints from developers that their merged changes were not visible in the
Looking at the affected PRs in GitHub, we saw that Bors had commented with "Your PR was involved in a batch that had a merge conflict...", but were unable to find the corresponding failed batch. We also found that the batch that ran after A (let's call Batch B and reports a status of "Cancelled"), contained the first 6 PRs from Batch A.
After investigating the squashed commits produced by Bors, we found that many of the commits included changes from other PRs. The unexpected changes were a mix of "merging" other PR commits and "reverting" other PR commits. After the final squashed commit for this Batch, more than half of the PRs in Batch A were technically "reverted" on our master branch.
After looking at the logs, we can see log messages that indicate 2 concurrent executions of
We've found that
We suspect that two
There can be a "long time" between when Bors picks the next waiting batch at the start of the poll handler and when the batch is marked as running towards the end of
We observe each iteration of the
This means that any other
We are keen to fix this issue, but are unsure of how best to proceed. We have some ideas like:
How should we go about fixing this, in the most Bors/Elixir correct way?
Use the second one.
I can start out by pointing out a few other things:
A GenServer, such as Batcher, is single-threaded. Bors-NG should only ever have one Batcher per repository at once, so it should not actually be possible for multiple things to run at the same time, but this might be the cause of your bug. Check if your logs have the ID of the genserver in them, to rule out the possibility that you've got duplicate batchers.
We are pretty sure they are running on the same GenServer (190), but would love to get your thought.
This are the log lines scoped to the time of our incident and on the 5 calls to
New lines were added after the "Commit Sha" lines to highlight the end of one
@notriddle Many thanks for the commit. We've been running this and may have detected another incident (attached are the logs).
We're still not 100% sure what to expect, but we were not expecting to see lines generated by
After spending some time learning more about GenServer and process naming (from https://www.brianstorti.com/process-registry-in-elixir/), we are wondering if it would be valuable to use the