Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash that caused by multiple "bors merge" commands before PR statuses pass #841

Merged
merged 1 commit into from Jan 6, 2020

Conversation

@ansonlc
Copy link
Contributor

ansonlc commented Jan 3, 2020

Context

We found this bug when people type "bors merge" multiple times before the PR statuses pass, and it causes bors to crash.

It is common for our developers to initiate a bors merge before all PR statuses have passed. Sometimes, they will invoke a second (or third) bors merge command on the same PR, whilst bors is still polling from the initial command.

When the statuses do pass, then we observe the following crash:

{%Ecto.InvalidChangesetError{
   action: :insert,
   changeset: #Ecto.Changeset<
     action: :insert,
     changes: %{
       batch_id: 10,
       patch_id: 44,
       reviewer: "some-dev"
     },
     errors: [patch_id: {"has already been taken", []}],
     data: #BorsNG.Database.LinkPatchBatch<>,
     valid?: false
   >
 },
 [
   {Ecto.Repo.Schema, :insert!, 4,
    [file: 'lib/ecto/repo/schema.ex', line: 128]},
   {BorsNG.Worker.Batcher, :run, 2,
    [file: 'lib/worker/batcher.ex', line: 245]},
   {BorsNG.Worker.Batcher, :handle_info, 2,
    [file: 'lib/worker/batcher.ex', line: 199]},
   {:gen_server, :try_dispatch, 4,
    [file: 'gen_server.erl', line: 637]},
   {:gen_server, :handle_msg, 6,
    [file: 'gen_server.erl', line: 711]},
   {:proc_lib, :init_p_do_apply, 3,
    [file: 'proc_lib.erl', line: 249]}
 ]}

We believe this is because the :prerun_poll loop is executing multiple times concurrently.

Steps to reproduce in dev

We were able to reproduce the error locally (via iex -S mix phx.server) by running the code steps described in steps_to_reproduce.md.txt

Proposed fix

Inside of the handle_info({:prerun_poll, timeout, args}, proj_id) loop, we do an additional check, to see if the patch is already running. This means the second (, third, etc) loops can finish gracefully and not invoking Batcher.run.

We were unable to do the check in do_handle_cast({:reviewed, patch_id, reviewer}, _project_id), because we do not know a way to check if a :prerun_poll loop is running in elixir.

Questions

  • is there way know if :prerun_poll is running when we enter do_handle_cast({:reviewed, patch_id, reviewer}, _project_id)?
  • is there a more elegant way to fix this issue?
…before PR statuses pass

We add an additional check in ":prerun_poll" to see if it is already running review
@notriddle

This comment has been minimized.

Copy link
Member

notriddle commented Jan 4, 2020

The elegant way to fix this would be to change the database schema. Right now, the status of a batch that is in its poll loop is the same as its status when it's in the background: :waiting.

If we added a database representation of the state, like call it :preflight, then it could be solved now elegantly.

@noizwaves

This comment has been minimized.

Copy link
Contributor

noizwaves commented Jan 6, 2020

Hey @notriddle, thank you for the feedback! We are definitely in favor of a more elegant solution when it is an option.

We're having some trouble seeing how the proposed addition of a Batch :preflight status
could be implemented. The original bug occurs when leaving the Patch preflight loop - at this point in the code, a Patch is not associated with a Batch, so this new status could not be checked.

It looks like what we need to store this new status of would be the Patch itself. But this model doesn't have a status field currently.

Any guidance you can provide would be greatly appreciated.

@notriddle

This comment has been minimized.

Copy link
Member

notriddle commented Jan 6, 2020

Yeah, that's true. If you're running into unexpected problems, I'm fine with merging this as-is. It's not that inelegant, adding a new patch column has lots of downsides, and, most importantly, we'd need to add a bunch of infrastructure to resume the preflight loop if the bot is restarted in the middle of it.

bors r+

bors bot added a commit that referenced this pull request Jan 6, 2020
Merge #841
841: Fix crash that caused by multiple "bors merge" commands before PR statuses pass r=notriddle a=ansonlc

## Context
We found this bug when people type "bors merge" multiple times before the PR statuses pass, and it causes bors to crash.

It is common for our developers to initiate a `bors merge` before all PR statuses have passed. Sometimes, they will invoke a second (or third) `bors merge` command on the same PR, whilst bors is still polling from the initial command.

When the statuses do pass, then we observe the following crash:
```
{%Ecto.InvalidChangesetError{
   action: :insert,
   changeset: #Ecto.Changeset<
     action: :insert,
     changes: %{
       batch_id: 10,
       patch_id: 44,
       reviewer: "some-dev"
     },
     errors: [patch_id: {"has already been taken", []}],
     data: #BorsNG.Database.LinkPatchBatch<>,
     valid?: false
   >
 },
 [
   {Ecto.Repo.Schema, :insert!, 4,
    [file: 'lib/ecto/repo/schema.ex', line: 128]},
   {BorsNG.Worker.Batcher, :run, 2,
    [file: 'lib/worker/batcher.ex', line: 245]},
   {BorsNG.Worker.Batcher, :handle_info, 2,
    [file: 'lib/worker/batcher.ex', line: 199]},
   {:gen_server, :try_dispatch, 4,
    [file: 'gen_server.erl', line: 637]},
   {:gen_server, :handle_msg, 6,
    [file: 'gen_server.erl', line: 711]},
   {:proc_lib, :init_p_do_apply, 3,
    [file: 'proc_lib.erl', line: 249]}
 ]}
```

We believe this is because the :prerun_poll loop is executing multiple times concurrently.

## Steps to reproduce in dev
We were able to reproduce the error locally (via `iex -S mix phx.server`) by running the code steps described in  [steps_to_reproduce.md.txt](https://github.com/bors-ng/bors-ng/files/4021099/steps_to_reproduce.md.txt)

## Proposed fix
Inside of the `handle_info({:prerun_poll, timeout, args}, proj_id)` loop, we do an additional check, to see if the patch is already running. This means the second (, third, etc) loops can finish gracefully and not invoking `Batcher.run`.

We were unable to do the check in `do_handle_cast({:reviewed, patch_id, reviewer}, _project_id)`, because we do not know a way to check if a `:prerun_poll` loop is running in elixir.

## Questions
- is there way know if `:prerun_poll` is running when we enter `do_handle_cast({:reviewed, patch_id, reviewer}, _project_id)`?
- is there a more elegant way to fix this issue?

Co-authored-by: Chao Lin <linchaoc@gmail.com>
@bors

This comment has been minimized.

Copy link
Contributor

bors bot commented Jan 6, 2020

Build succeeded

@bors bors bot merged commit d4b69be into bors-ng:master Jan 6, 2020
3 checks passed
3 checks passed
Community-TC (pull_request) TaskGroup: success
Details
Travis CI - Pull Request Build Passed
Details
bors Build succeeded
Details
@ansonlc ansonlc deleted the Gusto:cl-fix-multiple-bors-merge-command-bug branch Jan 6, 2020
notriddle added a commit to bors-ng/bors-ng.github.io that referenced this pull request Feb 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.