Delete shard content under lock #9083

Merged
merged 1 commit into from Jan 6, 2015

Projects

None yet

5 participants

@s1monw
Contributor
s1monw commented Dec 29, 2014

Once we delete the the index on a node we are closing all resources
and subsequently need to delete all shards contents from disk. Yet
this happens today under a lock (the shard lock) that needs to be
acquried in order to execute any operation on the shards data
path. We try to delete all the index meta-data once we acquired
all the shard lock but this operation can run into a timeout which causes
the index to remain on disk. Further, all shard data will be left on
disk if the timeout is reached.

This commit removes all the shards data just before the shard lock
is release as the last operation on a shard that belongs to a deleted
index.

supersedes #8608 & relates to #9009

@s1monw s1monw added the review label Dec 29, 2014
@rjernst rjernst commented on an outdated diff Dec 30, 2014
src/main/java/org/elasticsearch/env/NodeEnvironment.java
}
+ logger.trace("deleted shard {} directory, paths: [{}]", shardId, paths);
+ assert FileSystemUtils.exists(paths) == false;
+ IOUtils.rm(paths);
@rjernst
rjernst Dec 30, 2014 Member

Isn't this redundant since the exists check just returned false?

@rjernst rjernst commented on an outdated diff Dec 30, 2014
src/main/java/org/elasticsearch/index/IndexService.java
- shardInjector.getInstance(MergePolicyProvider.class).close();
- } catch (Throwable e) {
- logger.debug("[{}] failed to close merge policy provider", e, shardId);
- // ignore
- }
- try {
- shardInjector.getInstance(IndexShardGatewayService.class).close();
- } catch (Throwable e) {
- logger.debug("[{}] failed to close index shard gateway", e, shardId);
- // ignore
- }
+ closeShardInjector(reason, sId, shardInjector, indexShard);
+
+ logger.debug("[{}] closed (reason: [{}])", shardId, reason);
+ } catch (Throwable t) {
+ throw t;
@rjernst
rjernst Dec 30, 2014 Member

Why is the try/catch needed if just rethrowing the original?

@rjernst rjernst commented on an outdated diff Dec 30, 2014
src/main/java/org/elasticsearch/index/IndexService.java
+ }
+ try {
+ shardInjector.getInstance(IndexShardGatewayService.class).close();
+ } catch (Throwable e) {
+ logger.debug("[{}] failed to close index shard gateway", e, shardId);
+ // ignore
+ }
+ try {
+ // now we can close the translog
+ shardInjector.getInstance(Translog.class).close();
+ } catch (Throwable e) {
+ logger.debug("[{}] failed to close translog", e, shardId);
+ // ignore
+ }
+ try {
+ // now we can close the translog
@rjernst
rjernst Dec 30, 2014 Member

copied comment?

@rjernst
Member
rjernst commented Dec 30, 2014

LGTM, but my review was cursory as I am not very familiar with this code.

@dakrone dakrone commented on an outdated diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
} catch (Throwable e) {
- logger.debug("[{}] failed to close translog", e, shardId);
- // ignore
+ logger.warn("[{}] failed to close store on shard deletion", e, shardId);
@dakrone
dakrone Jan 2, 2015 Member

Closing the store doesn't necessarily mean the shard is being deleted, I tested this and this codepath can happen when the index is closed, so I think this should be "failed to close store on shard closing"

@dakrone
dakrone Jan 2, 2015 Member

or "failed to close store on shard removal"

@dakrone dakrone commented on an outdated diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
try {
- store.close();
- } catch (Throwable e) {
- logger.warn("[{}] failed to close store on shard deletion", e, shardId);
+ nodeEnv.deleteShardDirectorySafe(lock, indexSettings);
+ } catch (IOException e) {
+ logger.warn("{} failed to delete shard content", lock.getShardId());
@dakrone
dakrone Jan 2, 2015 Member

Can we log the exception here?

@dakrone dakrone commented on the diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
@@ -163,8 +166,13 @@ public int numberOfShards() {
}
@Override
- public UnmodifiableIterator<IndexShard> iterator() {
- return shards.values().iterator();
+ public Iterator<IndexShard> iterator() {
@dakrone
dakrone Jan 2, 2015 Member

Since shards is an ImmutableMap, can this return an UnmodifiableIterator again to indicate that no one should be changing things?

@s1monw
s1monw Jan 2, 2015 Contributor

I disagree - I removed this weirdness here since we wanna code against interfaces here

@dakrone
dakrone Jan 2, 2015 Member

I don't understand, why would we want to allow someone to remove a shard via .remove() on the iterator?

@s1monw
s1monw Jan 2, 2015 Contributor

we dont' it throws an exception

@dakrone dakrone commented on the diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
@@ -189,7 +201,7 @@ public IndexShard shardSafe(int shardId) throws IndexShardMissingException {
return indexShard;
}
- public ImmutableSet<Integer> shardIds() {
+ public Set<Integer> shardIds() {
@dakrone
dakrone Jan 2, 2015 Member

Same here, I think we should go back to the ImmutableSet, the types should keep people from thinking they can change things with these collections

@dakrone dakrone commented on an outdated diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
}
+ }
+ }
+
+ private void close(ShardId shardId, Injector shardInjector, Class<? extends Closeable>... toClose) {
@dakrone
dakrone Jan 2, 2015 Member

Just to avoid name collision (since there's already a public close method), can we rename this closeWithInjector or innerClose or something like that?

@dakrone
dakrone Jan 2, 2015 Member

Also, can you add javadoc for this (just something mentioning that this swallows exceptions)

@dakrone dakrone commented on the diff Jan 2, 2015
src/main/java/org/elasticsearch/index/IndexService.java
}
- Injectors.close(injector);
+ }
+ }
+
+ private class StoreCloseListener implements Store.OnClose {
@dakrone
dakrone Jan 2, 2015 Member

Out of curiosity, why create a class for this instead of an anonymous class in IndexService capturing the local shardId? There are no other instantiations of this class other than the single one

@s1monw
s1monw Jan 2, 2015 Contributor

@kimchy asked for it since historically we had GC problems if we did that on a guice index level

@dakrone dakrone commented on an outdated diff Jan 2, 2015
...rg/elasticsearch/index/gateway/IndexShardGateway.java
- try {
- boolean waited = latch.await(waitForMappingUpdatePostRecovery.millis(), TimeUnit.MILLISECONDS);
- if (!waited) {
- logger.debug("waited for mapping update on master for [{}], yet timed out");
- }
- } catch (InterruptedException e) {
- logger.debug("interrupted while waiting for mapping update");
+ try {
+ boolean waited = latch.await(waitForMappingUpdatePostRecovery.millis(), TimeUnit.MILLISECONDS);
+ if (!waited) {
+ logger.debug("waited for mapping update on master for [{}], yet timed out");
@dakrone
dakrone Jan 2, 2015 Member

I know this is dependent on #9102 where I commented to fix this, but I wanted to comment just to make sure the logging {} doesn't sneak back in during rebase.

@dakrone
Member
dakrone commented Jan 2, 2015

Left some comments, also with this change the CloseableComponent class can be removed entirely

@s1monw
Contributor
s1monw commented Jan 2, 2015

Left some comments, also with this change the CloseableComponent class can be removed entirely

yeah I removed it now :) I actually missed that!

@s1monw
Contributor
s1monw commented Jan 2, 2015

@dakrone I pushed a new commits

@dakrone
Member
dakrone commented Jan 2, 2015

LGTM

@bleskes bleskes commented on an outdated diff Jan 5, 2015
src/main/java/org/elasticsearch/env/NodeEnvironment.java
- }
- logger.trace("deleted shard {} directory, paths: [{}]", shardId, paths);
- assert FileSystemUtils.exists(paths) == false;
+ try (ShardLock lock = shardLock(shardId)) {
+ deleteShardDirectorySafe(lock, indexSettings);
+ }
+ }
+
+ /**
+ * Deletes a shard data directory.
+ *
+ * @param lock the shards lock
+ * @throws IOException if an IOException occurs
+ */
+ public void deleteShardDirectorySafe(ShardLock lock, @IndexSettings Settings indexSettings) throws IOException {
+ assert indexSettings != ImmutableSettings.EMPTY;
@bleskes
bleskes Jan 5, 2015 Member

can we check and except if the lock is not acquired? Maybe also rename the method name to deleteShardDirectoryUnderLock, not sure. We should also document the fact that the shard is expected to be locked.

@bleskes bleskes commented on the diff Jan 5, 2015
src/main/java/org/elasticsearch/index/IndexService.java
}
+ @Override
+ public void handle(ShardLock lock) {
@bleskes
bleskes Jan 5, 2015 Member

can we add a comment/doc that this is called under the shard lock? will also be a good idea to assert on it.

@bleskes
Member
bleskes commented Jan 5, 2015

I like it. Much simpler. I think we can remove the timeout on delete introduced temporarily in #9009, but if we do so, we need to try to delete all the shard folders that are not locked, instead of trying to acquire all locks and then delete them all together (or not). This is needed for shards that were just relocated away (and their locks / in memory registration released) but not yet deleted from disk.

@s1monw
Contributor
s1monw commented Jan 6, 2015

I applied changes to your comments. Can you take another look?

I like it. Much simpler. I think we can remove the timeout on delete introduced temporarily in #9009, but if we do so, we need to try to delete all the shard folders that are not locked, instead of trying to acquire all locks and then delete them all together (or not). This is needed for shards that were just relocated away (and their locks / in memory registration released) but not yet deleted from disk.

can we do this in a different change?

@bleskes
Member
bleskes commented Jan 6, 2015

LGTM. thx.

can we do this in a different change?

I'll do it. OK.

@s1monw s1monw [CORE] Delete shard content under lock
Once we delete the the index on a node we are closing all resources
and subsequently need to delete all shards contents from disk. Yet
this happens today under a lock (the shard lock) that needs to be
acquried in order to execute any operation on the shards data
path. We try to delete all the index meta-data once we acquired
all the shard lock but this operation can run into a timeout which causes
the index to remain on disk. Further, all shard data will be left on
disk if the timeout is reached.

This commit removes all the shards data just before the shard lock
is release as the last operation on a shard that belongs to a deleted
index.
7ec8973
@s1monw s1monw merged commit 7ec8973 into elastic:master Jan 6, 2015

1 check passed

CLA Commit author is a member of Elasticsearch
Details
@clintongormley clintongormley added :Core and removed review labels Mar 19, 2015
@clintongormley clintongormley changed the title from [CORE] Delete shard content under lock to Delete shard content under lock Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment