Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bors starts same batch multiple times concurrently, corrupts master branch, enqueues merged PRs again #875

Open
noizwaves opened this issue Feb 13, 2020 · 9 comments

Comments

@noizwaves
Copy link
Contributor

@noizwaves noizwaves commented Feb 13, 2020

This is mostly a follow on from this forum post about an incident that happened with our production Bors deployment.

Setup

We are running Bors with use_squash_merge = true

Problem

We received complaints from developers that their merged changes were not visible in the master branch, despite their PRs being marked as "Merged by Bors". The affected PRs were in Batch A and it reports a status of "Succeeded".

Looking at the affected PRs in GitHub, we saw that Bors had commented with "Your PR was involved in a batch that had a merge conflict...", but were unable to find the corresponding failed batch. We also found that the batch that ran after A (let's call Batch B and reports a status of "Cancelled"), contained the first 6 PRs from Batch A.

After investigating the squashed commits produced by Bors, we found that many of the commits included changes from other PRs. The unexpected changes were a mix of "merging" other PR commits and "reverting" other PR commits. After the final squashed commit for this Batch, more than half of the PRs in Batch A were technically "reverted" on our master branch.

Cause

After looking at the logs, we can see log messages that indicate 2 concurrent executions of Batcher.start_waiting_batch/1. Looking up the call stack, we see that this method is called from Batcher.poll_/1, which handles :poll messages.

We've found that :poll messages are sent under the following conditions:

  1. every 30 minutes by Batcher.handle_info/2
  2. after certain Batch state changes (via Batcher.poll_after_delay/1calls) namely:
    1. when a batch fails to start due to a merge conflict
    2. when a batch successfully builds
    3. when a batch build fails

We suspect that two :poll messages were sent in close proximity, causing the same waiting batch to start concurrently, causing a race condition that corrupted the temporary staging branch.

Root Cause

There can be a "long time" between when Bors picks the next waiting batch at the start of the poll handler and when the batch is marked as running towards the end of start_waiting_batch here. Log timestamps indicate this time was at least 2 minutes during our incident.

We observe each iteration of the staging.tmp2 squash and merge loop took 15 seconds.

This means that any other :poll messages that were received during this time would have started the batch merging again.

Next Steps

We are keen to fix this issue, but are unsure of how best to proceed. We have some ideas like:

  • including a random seed in temp staging branch names to prevent cross execution corruption
  • adding a new state "Starting" that represents when the squashing/merging is happening (in between "Waiting" and "Running")

How should we go about fixing this, in the most Bors/Elixir correct way?

@notriddle

This comment has been minimized.

Copy link
Member

@notriddle notriddle commented Feb 13, 2020

adding a new state "Starting" that represents when the squashing/merging is happening (in between "Waiting" and "Running")

Use the second one.

The staging.tmp branch name is hardcoded into a bunch of people's config files, so we really don't want to change it.

@noizwaves

This comment has been minimized.

Copy link
Contributor Author

@noizwaves noizwaves commented Feb 13, 2020

@notriddle thanks for the quick reply! If you have other approaches or solutions to this, we'd love to hear them too.

@notriddle

This comment has been minimized.

Copy link
Member

@notriddle notriddle commented Feb 13, 2020

I can start out by pointing out a few other things:

A GenServer, such as Batcher, is single-threaded. Bors-NG should only ever have one Batcher per repository at once, so it should not actually be possible for multiple things to run at the same time, but this might be the cause of your bug. Check if your logs have the ID of the genserver in them, to rule out the possibility that you've got duplicate batchers.

However, the :poll timers themselves act about the same as something like setTimeout in JavaScript, meaning there can be any number of timers in flight at one time for the same server. If more than one timer goes off at once, they will queue up (again, the same way setTimeout in Node does).

@noizwaves

This comment has been minimized.

Copy link
Contributor Author

@noizwaves noizwaves commented Feb 13, 2020

We are pretty sure they are running on the same GenServer (190), but would love to get your thought.

This are the log lines scoped to the time of our incident and on the 5 calls to Logger.info in Batcher.start_waiting_merged_batch:
bors-ng_2020-02-06_223000_2020-02-06_225900.log.

New lines were added after the "Commit Sha" lines to highlight the end of one Enum.reduce cycle.

@notriddle

This comment has been minimized.

Copy link
Member

@notriddle notriddle commented Feb 14, 2020

I don't sure that's actually the PID (the PID normally has the format <123.123.123>). I'm going to commit this; it'll help. #876

@noizwaves

This comment has been minimized.

Copy link
Contributor Author

@noizwaves noizwaves commented Feb 21, 2020

@notriddle Many thanks for the commit. We've been running this and may have detected another incident (attached are the logs).

We're still not 100% sure what to expect, but we were not expecting to see lines generated by <0.387.0> and <0.388.0> interleaved like this.

bors-ng_2020-02-14_201143_2020-02-21_201143.log

@noizwaves

This comment has been minimized.

Copy link
Contributor Author

@noizwaves noizwaves commented Feb 21, 2020

After spending some time learning more about GenServer and process naming (from https://www.brianstorti.com/process-registry-in-elixir/), we are wondering if it would be valuable to use the {:via, module, term} name registration to prevent a second Batcher process from being started.

@noizwaves

This comment has been minimized.

Copy link
Contributor Author

@noizwaves noizwaves commented Feb 21, 2020

@notriddle it looks like {:via, module, term} requires a "Registry" module as well, and that BorsNG.Worker.Batcher.Registry could already have all the required state to fulfill the contract :)

@notriddle

This comment has been minimized.

Copy link
Member

@notriddle notriddle commented Feb 22, 2020

Yeah, I'd happily accept a pull request that switched it over to using that.

bors bot added a commit that referenced this issue Feb 22, 2020
Merge #883
883: Validate that there is only one batcher every time we poll r=notriddle a=notriddle

Part of #875

Co-authored-by: Michael Howell <michael@notriddle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.