IndicesStore shouldn't try to delete index after deleting a shard #12487

imotov · 2015-07-27T15:56:38Z

When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies and no cluster state change have happened in the mean while, do we go and delete the shard folder.

Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard in use from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above.

Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted.

Instead, we should change the decision to clean up an index folder to be based on checking the index directory for being empty and containing no shards.

Note: this PR is against the 1.6 branch.

dakrone · 2015-07-27T16:08:59Z

LGTM

… shard When a node discovers shard content on disk which isn't used, we reach out to all other nodes that supposed to have the shard active. Only once all of those have confirmed the shard active, the shard has no unassigned copies *and* no cluster state change have happened in the mean while, do we go and delete the shard folder. Currently, after removing a shard, the IndicesStores checks the indices services if that has no more shard active for this index and if so, it tries to delete the entire index folder (unless on master node, where we keep the index metadata around). This is wrong as both the check and the protections in IndicesServices.deleteIndexStore make sure that there isn't any shard *in use* from that index. However, it may be the we erroneously delete other unused shard copies on disk, without the proper safety guards described above. Normally, this is not a problem as the missing copy will be recovered from another shard copy on another node (although a shame). However, in extremely rare cases involving multiple node failures/restarts where all shard copies are not available (i.e., shard is red) there are race conditions which can cause all shard copies to be deleted. Instead, we should change the decision to clean up an index folder to based on checking the index directory for being empty and containing no shards.

s1monw · 2015-08-05T15:27:53Z

to me it seems like that some of the safety that has been added here is still prone to concurrency issues. We for instance check this.indices in canDeleteIndexContents which can be concurrently modified by a createIndex operation so I think there are still races in this code that could cause problems?

imotov · 2015-08-07T21:57:12Z

@s1monw to me it looked like all these calls are performed on updateTask thread so they shouldn't run into concurrency issues. What did I miss?

s1monw · 2015-08-10T14:00:11Z

for instance

public boolean canDeleteIndexContents(Index index, Settings indexSettings) {
        final Tuple<IndexService, Injector> indexServiceInjectorTuple = this.indices.get(index.name());
        if (IndexMetaData.isOnSharedFilesystem(indexSettings) == false) {
            if (indexServiceInjectorTuple == null && nodeEnv.hasNodeFile()) {
                return true;
            }
        } else {
            logger.trace("{} skipping index directory deletion due to shadow replicas", index);
        }
        return false;
    }

checks if the index is present but it could be created concurrently such that it can potentially check before it's created but delete after it's creation was successful?

imotov · 2015-08-10T14:04:29Z

@s1monw concurrently on which thread?

s1monw · 2015-08-10T14:09:42Z

well we can call thsi API from everywhere I wonder if we should synchronize all these operations? there are pending deletes threads etc that are kicked off so I really thing we should protect that.

bleskes · 2015-08-13T11:50:59Z

+1 to not relaying on the methods to be called from the cluster state update thread. If I understand the concern correctly, It relates to an index being deleted and created concurrently. In that case I think the shard locking in NodeEnvironment will protect us from deleting data. However, I agree it would be nice to have clear concurrency semantics on this level of the code. Right now we sometimes synchronize and sometimes not which make it hard to reason about- that is potentially a much bigger change and I was trying to make the smallest intervention. @s1monw if I misunderstood what you said and there is a concrete hole in the logic - can you open an issue so we won't forget?

imotov added >bug v2.0.0-beta1 resiliency :Internal v1.7.1 v1.6.2 labels Jul 27, 2015

imotov assigned dakrone Jul 27, 2015

imotov mentioned this pull request Jul 27, 2015

IndicesStore shouldn't try to delete index after deleting a shard #12463

Closed

imotov force-pushed the shard_active_delete_index branch from 7b9134e to c0b6bdd Compare July 27, 2015 19:33

imotov merged commit c0b6bdd into elastic:1.6 Jul 27, 2015

imotov mentioned this pull request Jul 27, 2015

IndicesStore shouldn't try to delete index after deleting a shard #12494

Merged

clintongormley removed the v2.0.0-beta1 label Jul 28, 2015

clintongormley mentioned this pull request Aug 7, 2015

Large aggregation query -> OOM -> Index corruption -> data loss #12041

Closed

grantr mentioned this pull request Aug 14, 2015

Elasticsearch 1.7.1 boxen/puppet-elasticsearch#27

Merged

eeeebbbbrrrr mentioned this pull request Aug 31, 2015

Only support Elasticsearch 1.7.1+ zombodb/zombodb#28

Closed

imotov deleted the shard_active_delete_index branch May 1, 2020 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndicesStore shouldn't try to delete index after deleting a shard #12487

IndicesStore shouldn't try to delete index after deleting a shard #12487

imotov commented Jul 27, 2015

dakrone commented Jul 27, 2015

s1monw commented Aug 5, 2015

imotov commented Aug 7, 2015

s1monw commented Aug 10, 2015

imotov commented Aug 10, 2015

s1monw commented Aug 10, 2015

bleskes commented Aug 13, 2015

IndicesStore shouldn't try to delete index after deleting a shard #12487

IndicesStore shouldn't try to delete index after deleting a shard #12487

Conversation

imotov commented Jul 27, 2015

dakrone commented Jul 27, 2015

s1monw commented Aug 5, 2015

imotov commented Aug 7, 2015

s1monw commented Aug 10, 2015

imotov commented Aug 10, 2015

s1monw commented Aug 10, 2015

bleskes commented Aug 13, 2015