Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Close Index API] Refactor MetaDataIndexStateService #36354

Merged

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Dec 7, 2018

Note: this pull request will be merged in the close-index-api-refactoring branch

The pull request changes how indices are closed in the MetaDataIndexStateService. It now uses a 3 steps process where writes are blocked on indices to be closed, then some verifications are done on shards using the TransportVerifyShardBeforeCloseAction added in #36249, and finally indices states are moved to CLOSE and their routing tables removed.

The closing process also takes care of

  • using the pre-7.0 way to close indices if the cluster contains mixed version of nodes and a node does not support the TransportVerifyShardBeforeCloseAction
  • closing unassigned indices (ie indices with all primary shards unassigned)

Somes tests required to be adapted, as well as the Freeze action.

This pull request will have two immediate follow ups:

  • handling the task ids so that we can identify the freeze parent task that trigged a close task which triggered a verify shard before close task
  • change the Close Index API response format from AcknowledgedResponse to a custom CloseIndexResponse that provides more information about which indices were closed and why the verification shard failed for non closed indices.

@tlrx tlrx added >enhancement v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Dec 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial pass. I'm wondering a bit about the testing coverage here. I think we need some tests with concurrent close requests while indices are for example being deleted, or while shards are being unassigned, check idempotence etc. Do you consider adding an identity to the closed block as a follow-up?

return;
} else if (indexRoutingTable.allPrimaryShardsUnassigned()) {
logger.debug("index {} has been blocked before closing but is now unassigned, ignoring", index);
listener.onResponse(new AcknowledgedResponse(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do this check per IndexShardRoutingTable as well, otherwise we fail if some primary shards are unassigned while others are not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I pushed dbe7b2c168c31b2f740679acd15449fdd1511694

assertAcked(client().admin().indices().prepareClose("test-idx-some", "test-idx-all").execute().actionGet());
assertAcked(client().admin().indices().prepareClose("test-idx-all"));

AcknowledgedResponse closeIndexResponse = client().admin().indices().prepareClose("test-idx-some").setTimeout("3s").get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one should ack as well, see my comment above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this way we keep the previous behavior. I pushed 9ac483656a9d824b42090dac0044283cc68c8fa3

if (response.isAcknowledged()) {
toggleFrozenSettings(concreteIndices, request, listener);
} else {
listener.onResponse(new FreezeResponse(false, false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add TODO here that we should also extend the FreezeResponse object with info about failed closing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO added in a239b47a4dc0f52cd1f950eb2937fa3ee3f0d8ef

if (request.indices() == null || request.indices().length == 0) {
throw new IllegalArgumentException("Index name is required");
}
initiateClosing(request.indices(), request.masterNodeTimeout(), request.ackTimeout(), listener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the chaining at the top-level? Instead of having step1 calling step2 calling step3, I think it's better to have these as independent steps, and assemble them here at the top-level. The advantage is also that we can just mock out an intermediary step, making it easier to implement the closing in ClusterStateChanges as well as for testing these steps in isolation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 4d2e90f1fcc977166828d2a8120267c2edd9654e

@tlrx
Copy link
Member Author

tlrx commented Dec 7, 2018

@ywelsch Thanks for the review. I addressed this first bunch of comments if you want to add more feedback.

I'm wondering a bit about the testing coverage here. I think we need some tests with concurrent close requests while indices are for example being deleted, or while shards are being unassigned, check idempotence etc.

This is something I plan to do but as follow ups PRs once the Close Index API is refactored + stabilized in this feature branch. I don't plan to merge this PR until we all agree on the API, tests etc.

Do you consider adding an identity to the closed block as a follow-up?

To be honest this is something I put aside for now. Do you think this is a blocker/mandatory for this refactoring before merging the feature branch into master?

@tlrx
Copy link
Member Author

tlrx commented Dec 10, 2018

@ywelsch I added more tests, I'd be happy if you can have a look at them - and at other changes. Thanks

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tlrx. I like this better after the refactoring. I've left some more comments.

// start some concurrent indexing threads
for (final String index : indices) {
if (randomBoolean()) {
final Thread thread = new Thread(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a utility class called BackgroundIndexer which might be of help here, see testRelocationWhileIndexingRandom.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't know about this one.

@tlrx
Copy link
Member Author

tlrx commented Dec 12, 2018

@ywelsch I updated the code according to your comments. Can you have another look please?

I also added some unit tests.

throw new ElasticsearchException("bulk request failure, id: ["
+ bulkItemResponse.getFailure().getId() + "] message: "
+ bulkItemResponse.getFailure().getMessage());
failures.add(bulkItemResponse.getFailure().getCause());
Copy link
Member Author

@tlrx tlrx Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this so that BackgroundIndexer.totalIndexedDocs() returns the effective number of docs indexed and also reports all bulk item failures (and does not stop after the first failure).

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


// If the cluster is in a mixed version that does not support the shard close action,
// we use the previous way to close indices and directly close them without sanity checks
final boolean useDirectClose = currentState.nodes().getMinNodeVersion().before(Version.V_7_0_0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there mixed version cluster tests that check index closing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :qa:mixed-cluster tests runs the core REST tests and those tests contains some open/close YAML tests that fail without this logic

@tlrx tlrx mentioned this pull request Dec 13, 2018
50 tasks
@tlrx tlrx force-pushed the refactor-metadatainexstateservice branch from 67c4af8 to 4ebfbea Compare December 13, 2018 14:58
@s1monw
Copy link
Contributor

s1monw commented Dec 13, 2018

woohoo!

@tlrx
Copy link
Member Author

tlrx commented Dec 13, 2018

Build is green (elasticsearch-ci-1, elasticsearch-ci-2), I'm merging into the feature branch.

@tlrx tlrx merged commit 8e5dd20 into elastic:close-index-api-refactoring Dec 13, 2018
@tlrx tlrx deleted the refactor-metadatainexstateservice branch December 13, 2018 16:36
@tlrx
Copy link
Member Author

tlrx commented Dec 13, 2018

Thanks @ywelsch

tlrx added a commit that referenced this pull request Jan 14, 2019
This commit backports to 6.x of the Close Index API refactoring.

It cherry-picks the following commits from master:
3ca885e [Close Index API] Add TransportShardCloseAction for pre-closing verifications (#36249)
8e5dd20 [Close Index API] Refactor MetaDataIndexStateService (#36354)
7372529 [Tests] Reduce randomization in CloseWhileRelocatingShardsIT (#36694)
103c4d4 [Close Index API] Mark unavailable shard copy as stale during verification (#36755)
1959388 [Close Index API] Propagate tasks ids between Freeze, Close and Verify(#36630)
e149b08 [Close Index API] Add unique UUID to ClusterBlock (#36775)
dc371ef [Tests] Fix ReopenWhileClosingIT with correct min num shards

The following two commits were needed to adapt the change to 6.x:
ef6ae69 [Close Index API] Adapt MetaDataIndexStateServiceTests after merge
21b7653 [Tests] Adapt CloseIndexIT tests for 6.x

Related to #33888
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants