Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RCI] Keep index routing table for closed indices #34108

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Sep 27, 2018

Note: this pull request is against the replicated-closed-indices branch

This pull request changes the MetaDataIndexStateService class so that it does not remove the routing table of closed indices anymore but instead reinitializes the shards routing with an INITIALIZING state (and an unassigned INDEX_CLOSED reason). This way the primary shard allocator will take care of reallocating the shards to the nodes that already hold valid copies of the unassigned primaries, forcing the recreation of the shards on data node. Thanks to #33903 the shards will be recreated using a NoOpEngine.

This pull request also adds a removeIndices() the IndicesClusterStateService that detects when the state of an index is changed (CLOSE <-> OPEN) and close the associated IndexService, forcing its recreation.

I created this pull request to have feedback on these two points. The PR also adds some necessary tests and also adapts an important test IndicesClusterStateServiceRandomUpdatesTests.

It also mutes a lot of tests (more than I expected...) that sometimes fails because:

  • the test expects that closing an index release all shard resources (like shard lock) and this is not true anymore
  • the test expects that closing an index removes the index routing table for the index and this is not true anymore
  • the test expects that an index can be closed at anytime even if opened index shards are still initializing
  • the test fails because the translog is not empty when the shard is recreated using a NoOpEngine

For the first two cases, the test need to be adapted and I'd like to do this on a per-test basis. I haven't done it in this PR to reduce the noise.

For the two last cases, the Close Index API needs to be reworked so that it only closes an index when shards are active and fully in sync. This will address in another PR.

Relates #33888

This commit adds a new NoOpEngine implementation based on the current 
ReadOnlyEngine. This new implementation uses an empty DirectoryReader 
with no segments readers and will always returns 0 docs. The NoOpEngine 
is the default Engine created for IndexShards of closed indices. It expects 
an empty translog when it is instantiated.

Relates to elastic#33888
@tlrx tlrx added >enhancement :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Sep 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx tlrx mentioned this pull request Sep 27, 2018
50 tasks
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few questions and suggestions. Core logic looks very good already.

* Removes the {@link IndexService} of indices whose state has changed.
* Closing the index services here will force them to be recreated later along with their shards.
*/
private void removeIndices(final ClusterChangedEvent event) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fold this into updateIndices? There we use the same iteration order and have the same lookup patterns, so I think we can save a lot on the boiler plate and can avoid iterating yet another time over all indices

final AllocatedIndices.IndexRemovalReason reason =
indexMetaData != null && indexMetaData.getState() == IndexMetaData.State.CLOSE ? CLOSED : NO_LONGER_ASSIGNED;
AllocatedIndices.IndexRemovalReason reason = NO_LONGER_ASSIGNED;
if (indexMetaData != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth the complexity of this code here just to provide a better message as to why an index service got removed. If you think it's useful, maybe factor the logic of determining the AllocatedIndices.IndexRemovalReason based on currentState and newState into a helper method so it can be reused by removeIndices


@Before
public void setUpService() {
AllocationService allocationService = createAllocationService(Settings.EMPTY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also use this method without needing to subclass ESAllocationTestCase.

CloseIndexRequest closeIndexRequest = new CloseIndexRequest(state.metaData().index(index).getIndex().getName());
state = cluster.closeIndices(state, closeIndexRequest);
IndexMetaData indexMetaData = state.metaData().index(index);
if (state.routingTable().allShards(index).stream().allMatch(ShardRouting::started)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why restrict the test like this?

OpenIndexRequest openIndexRequest = new OpenIndexRequest(state.metaData().index(index).getIndex().getName());
state = cluster.openIndices(state, openIndexRequest);
IndexMetaData indexMetaData = state.metaData().index(index);
// Do not reopen an index that was just closed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

final Index index = indexService.index();
// do not start or fail shards of indices that were just closed or reopened, because
// they are still initializing and we must wait for the cluster state to be applied
// on node before starting or failing them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. The cluster state was already applied on the node, no? I don't understand the extra restriction here.

@@ -141,6 +141,7 @@ public void testHiddenFiles() throws IOException {
assertThat(e, hasToString(containsString(expected)));
}

@AwaitsFix(bugUrl = "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with this test?

@tlrx tlrx force-pushed the replicated-closed-indices branch 2 times, most recently from 95cfc1e to c2a3ec2 Compare November 8, 2018 10:59
@tlrx tlrx force-pushed the replicated-closed-indices branch 5 times, most recently from 49ce196 to fbbeff4 Compare January 15, 2019 11:14
@tlrx tlrx force-pushed the replicated-closed-indices branch 5 times, most recently from a0296fc to e53a9be Compare January 30, 2019 10:36
@tlrx
Copy link
Member Author

tlrx commented Jan 30, 2019

Closed in favor of #38024

@tlrx tlrx closed this Jan 30, 2019
@tlrx tlrx deleted the keep-index-routing-table-for-closed-indices branch July 1, 2019 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants