Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail engine without marking it as corrupt when recovering on SharedFS #11933

Closed
wants to merge 2 commits into from

Conversation

@areek
Copy link
Contributor

commented Jun 30, 2015

Currently when an engine is failed, it is marked as corrupted regardless of
the failure type. This change marks the engine as corrupted only when the failure
is caused by an actual index corruption. When an engine is failed for other
reasons, the engine is only closed without removing the shard state.

closes #11788

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2015

LGTM

@bleskes

This comment has been minimized.

Copy link
Member

commented Jun 30, 2015

I went through this and I think things are a bit confusing. We now sometimes leave a corruption marker the engine's index.engine.Engine#failEngine and we sometimes do it in .IndexShard.ShardEngineFailListener#onFailedEngine (which does not override the previous one, but you have to check store.markStoreCorrupted() to know that).

Also the listener signature is a bit confusing when used out side of the indexShard (do we need to mark it as corrupted in the IndicesClusterService class or not?) :

onFailedEngine(ShardId shardId, String reason, boolean markAsCorrupted, @Nullable Throwable failure)

I suggest the following:

Change index.engine.Engine#failEngine(java.lang.String, boolean, java.lang.Throwable) to always listen to it's markAsCorrupted parameter (don't introspect the exception type).

Change index.engine.Engine#failEngine(java.lang.String, java.lang.Throwable) to introspect the throwable it got and set the markAsCorrupted flag accordingly when calling the failEngine overload.

Remove the logic to set the corruption marker in index.shard.IndexShard.ShardEngineFailListener listener and also the markAsCorrupted from the listener.

@bleskes
bleskes reviewed Jun 30, 2015
View changes
core/src/main/java/org/elasticsearch/index/shard/IndexShard.java Outdated
/**
* Fails the shard without marking it as corrupted
*/
public void softFailShard(String reason, Throwable e) {

This comment has been minimized.

Copy link
@bleskes

bleskes Jun 30, 2015

Member

can we just make it an overload of failShard with a boolean, like in the engine? . I don't know what soft means :)

Also if we do that, let's change failShard(string reason, throwable e) to just call failShard(reason, true, e) ...

@areek areek force-pushed the areek:fix/soft_engine_fail branch Jun 30, 2015
@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jun 30, 2015

Updated PR:

  • IndexShard/Engine#failEngine(String reason, Throwable failure) marks the engine as corrupted
  • IndexShard/Engine#failEngine(String reason, boolean markAsCorrupted, Throwable failure) allows optional corruption marking
  • failEngine(String reason, boolean markAsCorrupted, Throwable failure) is only used in SharedFSRecoverySourceHandler
@bleskes
bleskes reviewed Jul 1, 2015
View changes
core/src/main/java/org/elasticsearch/index/engine/Engine.java Outdated
@@ -538,6 +548,13 @@ public void failEngine(String reason, Throwable failure) {
logger.warn("failed engine [{}]", failure, reason);
// we must set a failure exception, generate one if not supplied
failedEngine = failure;
if (markAsCorrupted && markedAsCorrupted == false) {

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 1, 2015

Member

what what ? it took me 10s seconds to seed the ed in marked but also, why do need to do it twice? can't we just always do it in the finally clause?

This comment has been minimized.

Copy link
@areek

areek Jul 1, 2015

Author Contributor

My bad, now we just do it in the finally clause

@bleskes
bleskes reviewed Jul 1, 2015
View changes
core/src/main/java/org/elasticsearch/index/engine/Engine.java Outdated
try {
store.markStoreCorrupted(ExceptionsHelper.unwrapCorruption(failure));
store.markStoreCorrupted(unwrapCorruption("failed engine (reason: [" + reason + "])", failure));

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 1, 2015

Member

I think we can skip the unwrapCorruption and always use the IOException which captures the reason as well.

This comment has been minimized.

Copy link
@areek

areek Jul 1, 2015

Author Contributor

done

@bleskes
bleskes reviewed Jul 1, 2015
View changes
core/src/main/java/org/elasticsearch/index/shard/IndexShard.java Outdated
@@ -746,6 +746,14 @@ public void failShard(String reason, Throwable e) {
engine().failEngine(reason, e);
}

/**
* Fails the shard, <code>markAsCorrupted</code> determines if
* the engine is marked as corrupted

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 1, 2015

Member

"will be marked as corrupted" ?

This comment has been minimized.

Copy link
@areek

areek Jul 1, 2015

Author Contributor

changed

@@ -746,6 +746,14 @@ public void failShard(String reason, Throwable e) {
engine().failEngine(reason, e);

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 1, 2015

Member

can we use failShard(reason, true, e) here? will be clear what it does. Also change the java docs please.

This comment has been minimized.

Copy link
@areek

areek Jul 1, 2015

Author Contributor

changed.

}
for (Engine.FailedEngineListener listener : delegates) {
try {
listener.onFailedEngine(shardId, reason, failure);

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 1, 2015

Member

nice :)

@bleskes

This comment has been minimized.

Copy link
Member

commented Jul 1, 2015

left some minor comments... thx @areek

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2015

Updated PR:

  • incorporate feedback
  • hitting OOM will not mark the engine as corrupt
@bleskes

This comment has been minimized.

Copy link
Member

commented Jul 2, 2015

LGTM!

@areek areek force-pushed the areek:fix/soft_engine_fail branch Jul 2, 2015
@kimchy

This comment has been minimized.

Copy link
Member

commented Jul 3, 2015

I have been thinking about it, what happens for cases where we are running out of disk space and such? The shard is not corrupted in such cases. Also, the commit now is different than what is described, right? We now almost always mark the shard as corrupted for any failure?

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2015

I have been thinking about it, what happens for cases where we are running out of disk space and such? The shard is not corrupted in such cases. Also, the commit now is different than what is described, right? We now almost always mark the shard as corrupted for any failure?

so I think it's good practice to tombstone a failed engine in general. I guess it's more a naming question rather than if we should actually should do it. I guess we should rename Store.markAsCorrupted into Store#markAsFailed. We can extend this entire procedure an make the store marker a real structure and add a reason and if it's fatal. so I think it's the right thing todo.

regarding the out of disk problem I guess we should fallback to deleting the shard state in that case.

@kimchy

This comment has been minimized.

Copy link
Member

commented Jul 7, 2015

I like the idea of marking a shard as failed and keeping a tombstone for the reason of failure, not calling it corruption. I guess we will need to know if it was because of corruption or not, we can have it as a woolen flag on mark as failed so we can keep it around when we load it?

I wonder if we should really remove the shard state at all? if it is marked, it is marked as failed, and can be a reason of corruption. If someone is running with 0 replicas, and they run out of disk space, the shard should be able to recover without external intervention?

@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2015

Why not mark shard as failed for all IOE and ISE thrown by the engine, except for "Out of disk space" exception? For OOM and "Out of disk space" exception, close the engine without removing the shard state, this way if there is any TransportException causing an engine failure (like recovery failure on SharedFS), it will just close the engine instead of marking it as failed. Thoughts?

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2015

I think we should for now only mark the shard as fail (corrupted) if it is really corrupted. the problem of having a broken engine even after a simple IO exception is from the days where we had no checkpoints in translog etc. so I think we can simply remove the boolean entirely and only mark as corrupted if it's really corrupted. all the otehr cases it's fine to just fail the engine.

@kimchy

This comment has been minimized.

Copy link
Member

commented Jul 7, 2015

@s1monw +1, I think this change is great, just remove the boolean corrupt flag, and use the throwable to decide if we want to write a corruption marker or not. We can do the other mark failed suggestions in a different change.

@areek areek force-pushed the areek:fix/soft_engine_fail branch Jul 7, 2015
…ess of

the failure type. This change marks the engine as corrupted only when the failure
is caused by an actual index corrruption. When an engine is failed for other
reasons, the engine is only closed without removing the shard state.

closes #11788
@areek areek force-pushed the areek:fix/soft_engine_fail branch to 5ba5725 Jul 7, 2015
@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2015

Updated PR:

  • mark engine as corrupted only when the index is corrupted
  • never delete shard state in case of engine failure, close the engine instead
  • close the engine if there is an "out of disk space" exception
@areek areek force-pushed the areek:fix/soft_engine_fail branch Jul 7, 2015
@bleskes
bleskes reviewed Jul 7, 2015
View changes
core/src/main/java/org/elasticsearch/index/engine/Engine.java Outdated
assert failure != null;
/**
* fail engine due to some error. the engine will also be closed.
* The engine is marked corrupted iff failure is caused by index corruption

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 7, 2015

Member

engine -> underlying store.

@bleskes
bleskes reviewed Jul 7, 2015
View changes
core/src/main/java/org/elasticsearch/index/engine/Engine.java Outdated
@@ -557,7 +559,7 @@ protected boolean maybeFailEngine(String source, Throwable t) {
failEngine("corrupt file detected source: [" + source + "]", t);
return true;
} else if (ExceptionsHelper.isOOM(t)) {
failEngine("out of memory", t);
failEngine("out of memory detected source: [" + source + "]", t);

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 7, 2015

Member

out of memory (source: [" + source + "])

@bleskes
bleskes reviewed Jul 7, 2015
View changes
core/src/main/java/org/elasticsearch/index/shard/IndexShard.java Outdated
@@ -741,7 +741,11 @@ public SnapshotIndexCommit snapshotIndex(boolean flushFirst) throws EngineExcept
}
}

public void failShard(String reason, Throwable e) {
/**
* Fails the shard and marks the engine as corrupted if

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 7, 2015

Member

marks the shard store

@bleskes

This comment has been minimized.

Copy link
Member

commented Jul 7, 2015

LGTM. Thanks @areek

@areek areek force-pushed the areek:fix/soft_engine_fail branch to d63857e Jul 7, 2015
@areek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2015

merged to master (4849e76) and 1.x (cfd8909)

@areek areek closed this Jul 7, 2015
@kevinkluge kevinkluge removed the review label Jul 7, 2015
areek added a commit to areek/elasticsearch that referenced this pull request Jul 9, 2015
Currently when an engine fails the shard state file is no longer deleted elastic#11933
and the underlying store is only marked as corrupted for index corruption exceptions.
This means that the store can be opened, even after it failed with IOE, OOM exceptions.

It would be useful to persist the engine failures that are not due to corruption for
inspection, these can be exposed later through elastic#11545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.