Shadow Replica indexes do not delete properly #17695

abeyad · 2016-04-12T22:16:10Z

This issue is to document deletion problems with shadow replica indices that were found while working on #17265. A separate PR #17638 that improves the naming of methods in the IndicesService also contains tests or added assertions to existing tests that reveal the issues below and must be enabled as part of any PR that fixes the issues.

No. 1
The index file deletion logic that is triggered in IndicesService#deleteIndexStore(String reason, Index index, IndexSettings indexSettings checks before deleting files to see if the index is not a shadow replica, or if it is, ensure that it has been closed before (so that no other nodes are holding resources to it). An issue with this is that it is too strict of a check, so that if a shadow replica index is deleted, if it was not previously closed, the index folder itself is not deleted and remains on the file system (an empty folder). So one of the issues that needs fixing is to ensure index directories are deleted even on shadow replica index deletes. The following tests have commented out assertions to test this behavior once fixed:

IndexWithShadowReplicaIT#testIndexWithShadowReplicasCleansUp
IndexWithShadowReplicaIT#testShadowReplicaNaturalRelocation

Note that shared shard data is cleaned up properly in a shadow replica index that is not closed, as the shard data is deleted by the StoreCloseListener. This is verified in the tests with the assertPathHasBeenCleared assert.

No. 2
The issue with deleting a shadow replica index that was previously closed is that all of the index and shard data are potentially deleted simultaneously by each node that receives the delete operation and invokes NodeEnvironment#deleteIndexDirectorySafe. This can lead to race conditions where a node is trying to delete a file that was deleted by another node as both are walking the file system simultaneously (using Lucene's IOUtils.rm). This ends up logged as a warning in IndicesService#deleteIndexStore(String reason, Index index, IndexSettings indexSettings and the deletion is put on the pending queue.

The text was updated successfully, but these errors were encountered:

bleskes · 2016-04-13T07:36:42Z

none of its index files are deleted

To be clear - shard level folders are deleted and the index metadata is deleted as well. The problem is that we leave an empty folder behind.

Re No 2.:

Do we have specific issue with master nodes or is it specifically about nodes that do not host a shard of the shard from the shadow index? (i.e., master nodes but also some data nodes)

abeyad · 2016-04-13T16:07:05Z

To be clear - shard level folders are deleted and the index metadata is deleted as well. The problem is that we leave an empty folder behind.

Yes, that is correct. I changed the description to be more clear on this.

Do we have specific issue with master nodes or is it specifically about nodes that do not host a shard of the shard from the shadow index? (i.e., master nodes but also some data nodes)

I think our tests prove that this isn't a specific issue, but only a general one as it relates to index folders not getting cleaned up. Since we are dropping the testDeletingIndexWithDedicatedMasterNodes, I will remove this point from the description as well.

I believe the fundamental problems with shadow replica deletion are 1. the leaving behind of these (empty) index folders and 2. when a shadow replica has been closed and then deleted, all nodes try to delete the same shared shard data folder, in which case, the first node succeeds, but the remaining nodes try to delete already deleted files (with the potential for race conditions here if multiple nodes are deleting at the same time). This throws a warning in the logs and puts the delete in the pending queue.

I will update the description of this issue accordingly.

bleskes · 2016-05-12T10:03:09Z

@abeyad was there any discussion on this ? if not we should pick it and make a plan.

abeyad · 2016-05-12T13:31:39Z

@bleskes nothing further has been done on this, its worth a discussion to see how best to handle the aforementioned scenarios.

abeyad · 2016-05-12T14:30:35Z

Upon discussion with Boaz, these issues are known and have been there for a long time. They don't cause incorrect behavior, they just prevent shadow replicas from deleting cleanly. It remains to be discussed if this is important to tackle and if so, what the appropriate solutions are.

@bleskes @clintongormley FYI

dakrone · 2016-08-08T19:53:44Z

@abeyad @clintongormley how do we feel about this being a blocker for 5.0? It's currently marked as a blocker due to a // norelease in IndexWithShadowReplicasIT pointing to this issue

abeyad · 2016-08-08T20:16:57Z

I'm inclined to say these are not blockers, because they result in improper full cleanup (empty directories laying around) or simultaneous deletes that are logged as such without tripping any errors. Also, adding the failed deletes to the pending queue just means they will get executed later and essentially be no-ops because there will be nothing to delete (as the deletion was already done successfully by one of the nodes).

So, its not great form, but given that shadow replicas aren't a ubiquitous feature and this issue doesn't actually cause incorrect behavior, I don't feel its a blocker.

If @clintongormley agrees, I can remove the the // norelease from the test.

@dakrone what do you think?

clintongormley · 2016-08-11T17:04:09Z

+1

asserts the indices directory is deleted on index deletion, as we are no longer considering it a blocker for releasing. Relates #17695

abeyad · 2016-08-11T17:09:53Z

I removed the blocker label and removed the //norelease in 50b31ce

javanna · 2017-09-22T15:47:39Z

I would close this now that shadow replicas have been removed. I do not think we will go back to old branches and fix this problem.

abeyad added >bug :Shadow Replicas v5.0.0-alpha2 labels Apr 12, 2016

abeyad mentioned this issue Apr 13, 2016

Improvements to the IndicesService class #17638

Closed

abeyad added the blocker label Apr 14, 2016

abeyad mentioned this issue Apr 14, 2016

Race conditions in Index Deletion #17767

Closed

clintongormley added v5.0.0-alpha3 and removed v5.0.0-alpha2 labels Apr 26, 2016

abeyad added discuss and removed blocker labels May 12, 2016

clintongormley added v5.0.0-alpha4 and removed v5.0.0-alpha3 labels May 26, 2016

clintongormley added v5.0.0-alpha5 and removed v5.0.0-alpha4 labels Jun 22, 2016

clintongormley removed the discuss label Jul 8, 2016

clintongormley added v5.0.0-beta1 and removed v5.0.0-alpha5 labels Jul 29, 2016

dakrone added the blocker label Aug 8, 2016

abeyad removed the blocker label Aug 11, 2016

abeyad pushed a commit that referenced this issue Aug 11, 2016

Remove //norelease from IndexWithShadowReplicasIT test that checks

50b31ce

asserts the indices directory is deleted on index deletion, as we are no longer considering it a blocker for releasing. Relates #17695

clintongormley added the v5.0.0 label Sep 14, 2016

clintongormley removed the v5.0.0-beta1 label Sep 14, 2016

clintongormley added v5.1.1 and removed v5.0.0 labels Oct 18, 2016

bleskes mentioned this issue Dec 7, 2016

Remove Shadow Replicas #22024

Closed

clintongormley added v5.1.2 and removed v5.1.1 labels Dec 7, 2016

clintongormley removed the v5.1.2 label Jan 12, 2017

javanna closed this as completed Sep 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shadow Replica indexes do not delete properly #17695

Shadow Replica indexes do not delete properly #17695

abeyad commented Apr 12, 2016

bleskes commented Apr 13, 2016

abeyad commented Apr 13, 2016

bleskes commented May 12, 2016

abeyad commented May 12, 2016

abeyad commented May 12, 2016

dakrone commented Aug 8, 2016

abeyad commented Aug 8, 2016 •

edited

Loading

clintongormley commented Aug 11, 2016

abeyad commented Aug 11, 2016

javanna commented Sep 22, 2017

Shadow Replica indexes do not delete properly #17695

Shadow Replica indexes do not delete properly #17695

Comments

abeyad commented Apr 12, 2016

bleskes commented Apr 13, 2016

abeyad commented Apr 13, 2016

bleskes commented May 12, 2016

abeyad commented May 12, 2016

abeyad commented May 12, 2016

dakrone commented Aug 8, 2016

abeyad commented Aug 8, 2016 • edited Loading

clintongormley commented Aug 11, 2016

abeyad commented Aug 11, 2016

javanna commented Sep 22, 2017

abeyad commented Aug 8, 2016 •

edited

Loading