Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshotting and sync could cause a dead lock TranslogWriter #18481

Merged
merged 2 commits into from May 20, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented May 20, 2016

#18360 introduced an extra lock in order to allow writes while syncing the translog. This caused a potential deadlock (see [1]) with snapshotting code where we first acquire the instance lock, followed by a sync (which acquires the syncLock). However, the sync logic acquires the syncLock first, followed by the instance lock.

I considered solving this by not syncing the translog on snapshot - I think we can get away with just flushing it. That however will create subtleties around snapshoting and whether operations in them are persisted. I opted instead to have slightly uglier code with nest synchronized, where the scope of the change is contained to the TranslogWriter class alone. If people feel differently, I can take the other approach.

[1]

  2> "elasticsearch[node_s1][index][T#1]" ID=8144 BLOCKED on org.elasticsearch.index.translog.TranslogWriter@7527aa29 owned by "elasticsearch[node_s1][generic][T#2]" ID=8079
  2>    at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:225)
  2>    - blocked on org.elasticsearch.index.translog.TranslogWriter@7527aa29
  2>    - locked java.lang.Object@46936c8a

  2> "elasticsearch[node_s1][generic][T#2]" ID=8079 BLOCKED on java.lang.Object@46936c8a owned by "elasticsearch[node_s1][index][T#1]" ID=8144
  2>    at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:219)
  2>    - blocked on java.lang.Object@46936c8a
  2>    at org.elasticsearch.index.translog.TranslogWriter.sync(TranslogWriter.java:151)
  2>    at org.elasticsearch.index.translog.TranslogWriter.newSnapshot(TranslogWriter.java:201)
  2>    - locked org.elasticsearch.index.translog.TranslogWriter@7527aa29

 elastic#18360 introduced an extra lock in order to allow writes while syncing the translog. This caused a potential deadlock with snapshotting code where we first acquire the instance lock, followed by a sync (which acquires the syncLock). However, the sync logic acquires the syncLock first, followed by the instance lock.

I considered solving this by not syncing the translog on snapshot - I think we can get away with just flushing it. That however will create subtleties around snapshoting and whether operations in them are persisted. I opted instead to have slightly uglier code with nest synchronized, where the scope of the change is contained to the TranslogWriter class alone.
@bleskes
Copy link
Contributor Author

bleskes commented May 20, 2016

@s1monw can you take a look?

@@ -19,6 +19,7 @@

package org.elasticsearch.index.translog;

import com.carrotsearch.randomizedtesting.annotations.Repeat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aye - missing a checkbox in Intellij :D

@jasontedor
Copy link
Member

LGTM.

@bleskes bleskes merged commit 34ef530 into elastic:master May 20, 2016
@bleskes bleskes deleted the translog_snapshot_deadlock branch May 20, 2016 10:56
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v5.0.0-alpha3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants