Deadlock produced in Postgres when running migrations in parallel with a migration task containing CREATE INDEX CONCURRENTLY #1654
What version of Flyway are you using?
Reproduced in 4.1.0 and on master. Affects 4.1.0 and later.
Which client are you using? (Command-line, Java API, Maven plugin, Gradle plugin, SBT plugin, ANT tasks)
What database are you using (type & version)?
Reproduced on Postgres 9.5.x
What operating system are you using?
Linux / Mac
What did you do?
(Please include the content causing the issue, any relevant configuration settings, and the command you ran)
I attempted to create an index concurrently within a migration task. The migration task ran at startup in a multi-node service. The pending acquisition of the advisory lock Flyway creates on one thread caused a deadlock in the in-progress migration which was creating the index in another thread. I have reproduced the issue in your test suite with some modifications and can be found on a branch in my fork, link below.
What did you expect to see?
As per the FAQ in the documentation, Flyway should be able to handle multiple nodes attempting to run migration tasks concurrently:
What did you see instead?
A deadlock and failure of the migration task. As the migration task is not transaction for CREATE INDEX CONCURRENTLY it leaves the database in an inconsistent state (with an index marked as invalid).
I am more than happy to provide a PR with a fix, but I'm unsure as to what the best way to fix this is. I figured I would create a bug with tests to reproduce the issue and we could discuss from there.
The text was updated successfully, but these errors were encountered:
This is the error actual error which occurs in flyway:
The error in the Postgres logs (effectively the same information):
I haven't tested Postgres 9.6.x yet. In terms of whether this is even fixable, I think that the strategy for locking a schema for migration would need to be changed. You may be able to switch to using
I have pushed another commit to my branch which contains the test reproducing the issue. It switches out pg_advisory_lock for pg_try_advisory_lock with a naive polling implementation. The tests now pass.
If you are happy with me pursuing this approach, I'm happy to clean up the branch and create a PR
@jhinch Thanks for investigating. Sounds great! Wouldn't it be better to spin indefinitely though? With your current implementation if a migration on a node that has acquired the lock takes more than 5 seconds other nodes will start failing due to being unable to acquire the lock. That doesn't sound right.
Yes, that part of my branch is very naive and was something I wanted to clean up. I think there are two options. Either spin indefinitely or have a configurable overall timeout. There would also be the question of whether it should have some sort of backoff while it waits (linear or exponential).
I'm happy to code up the solution with an indefinite spin. The interrupt exception on the sleep can be used to terminate if the process if the consumer would prefer it to timeout.