New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translog base flushes can be disabled after replication relocation or slow recovery #15830

Closed
wants to merge 2 commits into
base: 2.x
from

Conversation

Projects
None yet
3 participants
@bleskes
Member

bleskes commented Jan 7, 2016

#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled.

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Also - this is already fixed in master, where we don't have this background check,

Closes #15814

Translog base flushes can be disabled after replication relocation or…
… slow recovery

#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled.

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes #15814
@@ -118,37 +124,64 @@ private TimeValue computeNextInterval() {
return new TimeValue(interval.millis() + (ThreadLocalRandom.current().nextLong(interval.millis())));
}
private class TranslogBasedFlush implements Runnable {
// public for testing
public TranslogBasedFlush createTranslogBasedFlush() {

This comment has been minimized.

@s1monw

s1monw Jan 7, 2016

Contributor

just make the ctor public?

This comment has been minimized.

@bleskes
@Override
public void run() {
public void onFailure(Throwable t) {
if (t instanceof EsRejectedExecutionException) {

This comment has been minimized.

@s1monw

s1monw Jan 7, 2016

Contributor

instead of this check just override public void onRejection(Throwable t)

This comment has been minimized.

@bleskes

bleskes Jan 7, 2016

Member

this also to catch rejections coming from the flush thread pool. See #asyncFlushAndReschedule . I can wrap things there. Probably clearer...

return new TranslogBasedFlush();
}
public class TranslogBasedFlush extends AbstractRunnable {
private volatile long lastFlushTime = System.currentTimeMillis();

This comment has been minimized.

@s1monw

s1monw Jan 7, 2016

Contributor

we could add a volatile boolean here to indicate that we are either running or are scheduled and assert on that when we close the shard? just and idea to ensure we don't miss anything along the lines?

This comment has been minimized.

@bleskes

bleskes Jan 7, 2016

Member

assert we are scheduled? I can check. I think there is a future somewhere...

This comment has been minimized.

@bleskes

bleskes Jan 7, 2016

Member

didn't really find a way to do this without changing too much in the semantics of the class and I'm also not sure how much it will buy us. Please ping to clarify, if need be.

IndexService test = indicesService.indexService("test");
IndexShard shard = test.shard(0);
shard.close("Unexpected close", false);
shard.state = IndexShardState.CREATED;

This comment has been minimized.

@s1monw

s1monw Jan 7, 2016

Contributor

evil :) I wonder if we should do something more elaborate here like:

IndicesService indicesService = getInstanceFromNode(IndicesService.class);
IndexService test = indicesService.indexService("test");
IndexShard shard = test.shard(0);
ShardRouting routing = new ShardRouting(shard.routingEntry());
test.removeShard(0, "b/c boaz says so");
IndexShard newShard = test.createShard(routing);
DiscoveryNode localNode = new DiscoveryNode("foo", DummyTransportAddress.INSTANCE, Version.CURRENT);
newShard.recovering("for testing", RecoveryState.Type.REPLICA, localNode);

TranslogService translogService = new TranslogService(....)
assertTrue(checker.maybeFlushAndReschedule());

This comment has been minimized.

@bleskes

bleskes Jan 7, 2016

Member

yeah, different kind of evil. I stole this one from another test here. Can try.. not sure if it adds much - except maybe that we can check more states like before the recovering call and also store/snapshot recovery etc. Tried to keep it simple :)

@s1monw

This comment has been minimized.

Contributor

s1monw commented Jan 7, 2016

left some comments - thanks boaz

@s1monw s1monw added the blocker label Jan 7, 2016

@bleskes

This comment has been minimized.

Member

bleskes commented Jan 7, 2016

@s1monw thanks. Pushed an update

@s1monw

This comment has been minimized.

Contributor

s1monw commented Jan 7, 2016

this looks great LGTM

bleskes added a commit that referenced this pull request Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…
… slow recovery

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes #15814
Closes #15830

bleskes added a commit that referenced this pull request Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…
… slow recovery

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes #15814
Closes #15830

bleskes added a commit that referenced this pull request Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…
… slow recovery

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes #15814
Closes #15830

bleskes added a commit that referenced this pull request Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…
… slow recovery

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes #15814
Closes #15830
@bleskes

This comment has been minimized.

Member

bleskes commented Jan 7, 2016

I pushed this to all the branched. Thanks @s1monw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment