Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds tombstones to cluster state for index deletions #17265

Closed

Conversation

Projects
None yet
6 participants
@abeyad
Copy link
Contributor

abeyad commented Mar 23, 2016

Previously, we would determine index deletes in the cluster state by
comparing the index metadatas between the current cluster state and the
previous cluster state and decipher which ones were missing (the missing
ones are deleted indices). This led to a situation where a node that went
offline and rejoined the cluster could potentially cause dangling indices to
be imported which should have been deleted, because when a node rejoins,
its previous cluster state does not contain reliable state.

This commit introduces the notion of index tombstones in the cluster
state, where we are explicit about which indices have been deleted.
In the case where the previous cluster state is not useful for index metadata
comparisons, a node now determines which indices are to be deleted based
on these tombstones in the cluster state. There is also functionality to
purge the tombstones after exceeding a certain amount.

Closes #16358
Closes #17435

@abeyad

This comment has been minimized.

Copy link
Contributor Author

abeyad commented Mar 23, 2016

@bleskes @ywelsch Its a WIP, still have some functionality to implement and tests to write, but just in case you want to give a quick glance beforehand to see if its on the right track.

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch Mar 23, 2016

@bleskes

View changes

core/src/main/java/org/elasticsearch/cluster/ClusterChangedEvent.java Outdated
IndexTombstone tombstone = cursor.value;
// we should only try to delete indices that have tombstones added since
// the last time we processed cluster state
if (tombstone.getClusterVersion() > previousVersion) {

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 23, 2016

Member

I think we should compare the previous tombstone and the current one and generate a delta. Don't try to be overly smart.

@bleskes

This comment has been minimized.

Copy link
Member

bleskes commented Mar 23, 2016

Hi @abeyad . Thanks for picking it up. I think this can be done in a single tombstone class which is basically a queue of deleted Index object. new entries are always added at the end. Trimming is always done at the beginning. Every time you add an entry the class automatically captures the current time (both in millis and in nanos) and add it to an internal key class. Internally we can assert semantics like "every index appears once". That class can also have methods to do trimming (both on time and size).

Does it make sense?

@abeyad

This comment has been minimized.

Copy link
Contributor Author

abeyad commented Mar 23, 2016

@bleskes ++ on queue of deleted objects for ease of insertion and trimming from the front. The map made it easier to assert "every index appears once" semantics, but I can separate the internal representation from what is serialized.

Every time you add an entry the class automatically captures the current time (both in millis and in nanos) and add it to an internal key class.

I'm not clear on this - I figured we would need the current time on each entry (hence creating the IndexTombstone class to represent each entry). I'm not sure exactly what you mean by the adding current time to an internal key class.

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/IndexTombstone.java Outdated
private static final String INDEX_NAME_KEY = "indexName";
private static final String DELETE_DATE_KEY = "deleteDate";
private static final String CLUSTER_VERSION_KEY = "clusterVersion";
private static final ObjectParser<IndexTombstone.Builder, Void> TOMBSTONE_PARSER = new ObjectParser<>("indexTombstone");

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

YAY

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/IndexTombstone.java Outdated
final public class IndexTombstone extends AbstractDiffable<IndexTombstone> implements ToXContent {

public static final IndexTombstone PROTO = new IndexTombstone(null, null, 0L, 0L);
private static final String INDEX_UUID_KEY = "indexUUID";

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

can we use _ case instead of camelCase?

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/IndexTombstone.java Outdated
TOMBSTONE_PARSER.declareLong(IndexTombstone.Builder::clusterVersion, new ParseField(CLUSTER_VERSION_KEY));
}

private final String indexUUID;

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

can this just be Index instead of uuid and name?

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/IndexTombstone.java Outdated
}

/** A builder for building IndexTombstone objects. */
public static class Builder {

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

the builder can be private no? I mean others can just use a ctor? and then we also don't need getters but only setters? and they don't need to return this -- you can tell how much I like builders

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 23, 2016

Author Contributor

The issue is, if we use the builder to create an IndexTombstone, then it needs to be accessible to the MetaDataDeleteIndexService where it is created.

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/IndexTombstone.java Outdated
}

public IndexTombstone build() {
assert indexUUID != null : "indexUUID must be set";

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

these should all be real checks in the ew IndexTombstone(indexUUID, indexName, deleteDate, clusterVersion); ctor

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/MetaData.java Outdated
@@ -165,6 +168,7 @@ public static void registerPrototype(String type, Custom proto) {
private final ImmutableOpenMap<String, IndexMetaData> indices;
private final ImmutableOpenMap<String, IndexTemplateMetaData> templates;
private final ImmutableOpenMap<String, Custom> customs;
private final ImmutableOpenMap<String, IndexTombstone> indexTombstones;

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

do they need to be keyed by the index name? I think it can just be a List of tombstones?

@s1monw

View changes

core/src/main/java/org/elasticsearch/cluster/metadata/MetaData.java Outdated
@@ -915,6 +946,29 @@ public Builder customs(ImmutableOpenMap<String, Custom> customs) {
return this;
}

public Builder put(final IndexTombstone tombstone) {

This comment has been minimized.

Copy link
@s1monw

s1monw Mar 23, 2016

Contributor

I think we should just have one way of passing this? all the syntactic sugar is unneedd?

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Mar 23, 2016

Hi @abeyad . Thanks for picking it up. I think this can be done in a single tombstone class which is basically a queue of deleted Index object. new entries are always added at the end. Trimming is always done at the beginning. Every time you add an entry the class automatically captures the current time (both in millis and in nanos) and add it to an internal key class. Internally we can assert semantics like "every index appears once". That class can also have methods to do trimming (both on time and size).

I think the current design is OK. It's really a value object and doesn't contain logic. It has the serializaiton and deserialization in there which is good. It can also implement comparable which is then taking the time into account. I also think we shouldn't mix datastructure that is on the clusterstate and representation.

Regarding a queue, I think we should just stick with a simple list we can sort once it's modified and ensure in the Clusterstate ctor that is in-fact sorted but keep it simple.

I also think we might even go without pruning in thirst PR and do the pruning as a followup? It can block a lot of good progress. There are a lot of open questions related to this and for how long we keep there tombstones, I think we should try to keep them for as long as possible but the question of how long is very hard to answer.

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch 7 times, most recently Mar 25, 2016

@abeyad abeyad changed the title WIP: Adds tombstones to cluster state for index deletions Adds tombstones to cluster state for index deletions Mar 28, 2016

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch Mar 28, 2016

@abeyad

View changes

core/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java Outdated
indicesService.deleteClosedIndex("closed index no longer part of the metadata", metaData, event.state());
} else {
indexSettings = null;
}

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 28, 2016

Author Contributor

@bleskes I had to change the logic here, because in the case of a node restarting, its previous state will not contain the index metadata for an index that was deleted while it was offline, so if the index metadata for the deleted index (part of the tombstones in the cluster state) is not in the previous cluster state, I try to read it off disk. I had to introduce the MetaStateService as a dependency in this class in order to do that. If the index metadata could not also be read off of disk, that means the index was both created and deleted while the node was offline, so there is nothing to do.

I am unsure if this is the best approach, so would appreciate your feedback.

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 29, 2016

Member

Yeah, I see what you mean. If the node is restarted while the index was deleted it may have no previous state. I think what you do is the right thing, but we can structure it in a slightly cleaner way:

if (idxService != null) {
  // delete in memory index
                  deleteIndex(index, "index no longer part of the metadata");

    // ackNodeIndexDeleted
} else if (previousState.metaData().hasIndex(index)) { < --- needs a variant the checks a uuid
  // deleted index which wasn't assigned to local node (the closed index is very misleading below) 
  indicesService.deleteClosedIndex("closed index no longer part of the metadata", metaData, event.state());
  // ack the index deletion 
} else if (indicesService.canDeleteIndex(Index)) { <-- which should also checks for the folder existence like   
  // load metadata from file and delete it
}

wdyt?

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 29, 2016

Author Contributor

That makes sense, I will formulate the logic in that manner.

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 29, 2016

Author Contributor

Only issue with

else if (indicesService.canDeleteIndex(Index))

is that canDeleteIndexContents requires the IndexSettings, which can't be created until the IndexMetaData is loaded, so all that logic will need to go in an else block..


assertThat(client().prepareGet("test", "type1", "1").execute().actionGet().isExists(), equalTo(true));
}

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 28, 2016

Author Contributor

@bleskes @s1monw I removed this test as it was failing for this PR, but I wanted to validate the logic with you before making a decision on how to proceed. In the new scenario of explicit index tombstones in the cluster state, I do not see why a dangling index should be imported in this case (hence, the reason I deleted the test for the time being). In this test, first an index named test is created and a document is indexed into it. This index will have some UUID, lets call it X. Then the non master node goes offline, and test with UUID X is deleted. Then, an index named test2 is created with a UUID, lets call it Y, and this new index is assigned an alias of test. Then, the non master node is restarted. At this point, when it is restarted, it receives cluster state from master with a tombstone for index test of UUID X. So the non master node sees that it has this index metadata on disk and deletes it. Therefore, there is nothing to import at the dangling index stage, even when the alias for test2 is dropped. I'm not sure why we would want that index to be imported as test in the first place, as it was explicitly deleted and only test2 should exist.

That was my logic when thinking that this test is no longer valid, but I could well be wrong and am not sure, so would appreciate any validation either way.

This comment has been minimized.

Copy link
@bleskes

bleskes Mar 28, 2016

Member

The test seems to be there to make sure dangling indices are not imported if you have an alias with the same name as the index, but are imported once the alias is removed. I think that's a valid test. The problem is in the way the test was creating a dangling index by delete one when a node is offline. This doesn't work anymore indeed (hurrah!) . We need to fix the test to work harder (change uuids on the index before importing?)

This comment has been minimized.

Copy link
@abeyad

abeyad Mar 28, 2016

Author Contributor

@bleskes The IndicesClusterStateService#clusterChanged (which is where deleted indices are applied) gets called before GatewayMetaState#clusterChanged (which is where process dangling indices happens). Now that we have the tombstones specifying an exact name/uuid index to delete, those tombstones will get processed and indices on disk will get deleted before any attempts at dangling indices. Its not possible for a dangling index with the same name as an alias to be laying around at the import dangling index phase unless the tombstones got wiped out from the cluster state, because that same index would have had to be deleted before assigning an alias to another index with the same name. So the only way it seems possible to have a dangling index that conflicts with an alias name is the following scenario:

  1. Master M and data node D are in the cluster.
  2. An index named test is created.
  3. D goes offline.
  4. M deletes index test.
  5. M goes offline and has its data directory wiped out.
  6. M restarts with no indices, no tombstones, and a brand new cluster UUID.
  7. M creates index test2 with alias test.
  8. D starts back up, at which point it receives cluster state from master indicating there are no index tombstones (so nothing to delete), and when it tries to import the dangling indices, there is a naming conflict with the alias and it should fail. However, if we drop the alias, then we should be able to import the dangling index.

So perhaps thats the scenario I should test for?

This scenario also made me realize that we still have the same issue described in #16358 because if the CS gets wiped out and master no longer has the index tombstone for a shadow replica index, then the node that went offline will still try to import the dangling shadow replica index when it restarts, and encounter the same exception in trying to access the deleted shard data.

This comment has been minimized.

Copy link
@bleskes

bleskes Apr 6, 2016

Member

Can we revisit this test? as said before, I think it's a valid test. The problem is in how it generates dangling indices.

This comment has been minimized.

Copy link
@bleskes

bleskes Apr 6, 2016

Member

I see it was replaced with testDanglingIndicesWithAliasConflict , sorry for the noise

@abeyad

This comment has been minimized.

Copy link
Contributor Author

abeyad commented Mar 28, 2016

@bleskes @s1monw This PR is ready for the next round of review. I left two specific comments that need special attention, please, as I was unsure of the proper route to take:
https://github.com/elastic/elasticsearch/pull/17265/files#r57590545
https://github.com/elastic/elasticsearch/pull/17265/files#r57592054

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch Mar 29, 2016

private final List<Tombstone> tombstones;

private IndexGraveyard(final List<Tombstone> list) {
tombstones = Collections.unmodifiableList(list);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think that we want a null check here?

This comment has been minimized.

Copy link
@abeyad

abeyad Apr 25, 2016

Author Contributor

Its a private constructor, only called from the Builder, would an assert be more appropriate?

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Its a private constructor, only called from the Builder, would an assert be more appropriate?

An assert is fine.


@Override
public int hashCode() {
return Objects.hash(tombstones);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Sometimes the simplicity of calling Objects#hash is worth it, but in this case it's simple enough to just compute the hash code directly as tombstones.hashCode and avoid the array allocation?

This comment has been minimized.

Copy link
@abeyad

abeyad Apr 25, 2016

Author Contributor

Agreed, the class had more members which is why I used the Objects#hash, but now that it only has one member, it makes sense.

static {
TOMBSTONE_PARSER.declareObject(Tombstone.Builder::index, Index::parseIndex, new ParseField(INDEX_KEY));
TOMBSTONE_PARSER.declareLong(Tombstone.Builder::deleteDateInMillis, new ParseField(DELETE_DATE_IN_MILLIS_KEY));
TOMBSTONE_PARSER.declareString((b,s) -> {}, new ParseField(DELETE_DATE_KEY));

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Nit: (b,s) -> (b, s)

}

static BiFunction<XContentParser, Void, Tombstone> getParser() {
return (p,c) -> TOMBSTONE_PARSER.apply(p, c).build();

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Nit: (p,c) -> (p, c)

private static final String DELETE_DATE_IN_MILLIS_KEY = "delete_date_in_millis";
private static final String DELETE_DATE_KEY = "delete_date";
private static final ObjectParser<Tombstone.Builder, Void> TOMBSTONE_PARSER =
new ObjectParser<>("tombstoneEntry", Tombstone.Builder::new);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think I'd prefer it if the construction was moved to be alongside the initialization in the class initializer.

public static final IndexGraveyard PROTO = new IndexGraveyard(new ArrayList<>());
public static final String TYPE = "index-graveyard";
private static final ParseField TOMBSTONES_FIELD = new ParseField("tombstones");
private static final ObjectParser<List<Tombstone>, Void> GRAVEYARD_PARSER = new ObjectParser<>("index_graveyard", ArrayList::new);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think I'd prefer it if the construction was moved to be alongside the initialization in the class initializer.

return false;
}
@SuppressWarnings("unchecked")
Tombstone that = (Tombstone) other;

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Can we put this on the same line as the suppression?


public IndexGraveyard indexGraveyard() {
@SuppressWarnings("unchecked")
IndexGraveyard graveyard = (IndexGraveyard) getCustom(IndexGraveyard.TYPE);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Can we move this to be on the same line as the suppression?


private final String name;
private final String uuid;

public Index(String name, String uuid) {
Objects.requireNonNull(name);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think we can simplify the assignment of name to this.name = Objects.requireNonNull(name).intern();


private final String name;
private final String uuid;

public Index(String name, String uuid) {
Objects.requireNonNull(name);
Objects.requireNonNull(uuid);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think we can simplify the assignment of uuid to this.uuid = Objects.requireNonNull(uuid).intern();

}

private void deleteIndexStoreWithPredicate(final String reason, final Index index, final IndexSettings indexSettings,
final IndexDeletionPredicate predicate) throws IOException {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Can we use a clearer name than predicate?

@@ -1073,4 +1118,13 @@ public void onRemoval(RemovalNotification<IndicesRequestCache.Key, IndicesReques

}

@FunctionalInterface
interface IndexDeletionPredicate extends BiFunction<Index, IndexSettings, Boolean> {
Boolean apply(Index index, IndexSettings indexSettings);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Is the boxing necessary? Can we just not extend BiFunction and use a primitive boolean?

* Can only be invoked on a cluster where each node has been configured with shared data
* paths and the other necessary settings for shadow replicas.
*/
public final void createShadowReplicaIndex(final String name, final Path dataPath, final int numReplicas) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Can we move this to where it's used rather than adding it to ESIntegTestCase?

This comment has been minimized.

Copy link
@abeyad

abeyad Apr 25, 2016

Author Contributor

The reason I put it there is because that's typically where one would go to see if such a function is already available to them and if not, potentially recreate one in their own test without knowing it exists in another IT test. I thought this was one of those "generally useful" functions that could be used elsewhere, but if its not, I can move it out.

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

If it becomes generally useful in a future PR, we can catch it on review. 😄

* Add a deleted index to the list of tombstones in the cluster state.
*/
public Builder addTombstone(final Index index) {
tombstones.add(new Tombstone(index, System.currentTimeMillis()));

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I think it'd be cleaner if the Builder captured the current time once and just set all the tombstones for that builder to have the same time (rather than the tombstones for a single delete request that contains multiple indices having slightly different timestamps).

This comment has been minimized.

Copy link
@abeyad

abeyad Apr 25, 2016

Author Contributor

Great point.

@SuppressWarnings("unchecked")
final IndexGraveyard old = (IndexGraveyard) previous;
if (removedCount > old.tombstones.size()) {
throw new IllegalStateException("IndexGraveyardDiff cannot remove " + removedCount + " entries from " +

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

We have a pretty common format of wrapping variables in log messages in brackets. Can we do the same here?

stateIndices.add(iter.next().getIndex());
}
final int numDel;
switch (deletionQuantity) {

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

I'd prefer to see some braces on these case blocks.

case DELETE_ALL: numDel = stateIndices.size(); break;
case DELETE_NONE: numDel = 0; break;
case DELETE_RANDOM: numDel = randomIntBetween(0, Math.max(stateIndices.size() - 1, 0)); break;
default: throw new IllegalArgumentException("Unhandled mode " + deletionQuantity);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Since this is a test, can we just assert false : deletionQuantity?

This comment has been minimized.

Copy link
@abeyad

abeyad Apr 25, 2016

Author Contributor

You could, but then numDel won't necessarily be fully initialized (b/c the assert may not be turned on), so the compiler complains, unless we add some dummy numDel value for the default case, which I don't see as ideal.

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Got it. Let's throw an AssertionError then?

@jasontedor

This comment has been minimized.

Copy link
Member

jasontedor commented Apr 25, 2016

@abeyad I think that the high-level concepts have been ironed out, but I left some feedback on coding details.

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch Apr 25, 2016

Ali Beyad

@abeyad abeyad force-pushed the abeyad:feature/tombstone-deleted-indices branch to 751f5a8 Apr 25, 2016

@@ -175,6 +177,7 @@ public IndexGraveyard readFrom(final StreamInput in) throws IOException {
final public static class Builder {
private List<Tombstone> tombstones;
private int numPurged = -1;
private long currentTime = System.currentTimeMillis();

This comment has been minimized.

Copy link
@jasontedor

jasontedor Apr 25, 2016

Member

Can it be final?

Ali Beyad
@jasontedor

This comment has been minimized.

Copy link
Member

jasontedor commented Apr 25, 2016

LGTM. Great work @abeyad.

@abeyad

This comment has been minimized.

Copy link
Contributor Author

abeyad commented Apr 25, 2016

@jasontedor thank you for all the valuable feedback!

@abeyad abeyad closed this in d39eb2d Apr 25, 2016

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented May 2, 2016

@abeyad looks like these settings still need to be documented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.