Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap a new history_uuid when force allocating a stale primary #33432

Merged
merged 6 commits into from Sep 8, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 5, 2018

This commit ensures that we bootstrap a new history_uuid when force
allocating a stale primary. A stale primary should never be the source
of an operation-based recovery to another shard which exists before the
forced-allocation.

Closes #26712

This commit ensures that we bootstrap a new history_uuid when force
allocating a stale primary. A stale primary should never be the source
of an operation-based recovery to another shard which exists before the
forced-allocation.
@dnhatn dnhatn added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.5.0 labels Sep 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn added the review label Sep 5, 2018
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice that we can do this now via store with little hassle.

@@ -107,25 +112,68 @@ public int hashCode() {
}

/**
* recovery from an existing on-disk store or a fresh copy
* Recovery from a refresh copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh -> fresh?


Set<String> newHistoryUUIds = Arrays.stream(client().admin().indices().prepareStats("test").clear().get().getShards())
.map(shard -> shard.getCommitStats().getUserData().get(Engine.HISTORY_UUID_KEY)).collect(Collectors.toSet());
assertThat(newHistoryUUIds, everyItem(not(isIn(historyUUIDs))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mabye assert that all new history uuids are the same, just for fun?

@dnhatn
Copy link
Member Author

dnhatn commented Sep 8, 2018

Thanks @bleskes for reviewing.

@dnhatn dnhatn merged commit 94e4cb6 into elastic:master Sep 8, 2018
@dnhatn dnhatn deleted the new_history_uuid_stale_primary branch September 8, 2018 23:30
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 9, 2018
* master:
  CORE: Make Pattern Exclusion Work with Aliases (elastic#33518)
  Reverse logic for CCR license checks (elastic#33549)
  Add latch countdown on failure in CCR license tests (elastic#33548)
  HLRC: Add put stored script support to high-level rest client (elastic#31323)
  Create temporary directory if needed in CCR test
  Add license checks for auto-follow implementation (elastic#33496)
  Bootstrap a new history_uuid when force allocating a stale primary (elastic#33432)
  INGEST: Remove Outdated TODOs (elastic#33458)
  Logging: Clean up skipping test
  Logging: Skip test if it'd fail
  CRUD: AwaitsFix entire wait_for_refresh close test
  Painless: Add Imported Static Method (elastic#33440)
dnhatn added a commit that referenced this pull request Sep 10, 2018
…33432)

This commit ensures that we bootstrap a new history_uuid when force
allocating a stale primary. A stale primary should never be the source
of an operation-based recovery to another shard which exists before the
forced-allocation.

Closes #26712
dnhatn added a commit that referenced this pull request Sep 10, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Sep 10, 2018

Note that if the node crashes between shard initialization and shard recovery, the allocation id (both in the in-sync set as well as on the node's local shard copy) will be adjusted but the history UUID will not, which is unsafe to do. I would have preferred to put the history UUID into the index metadata (on a per-shard level, similar to in-sync-allocation ids), and then realigned the shards history uuid with the index metadata one, avoiding the above problem.

@bleskes
Copy link
Contributor

bleskes commented Sep 10, 2018

@ywelsch that's a good insight. If we put the history uuid in the index meta data, I think we need to work out the exact semantics of it and how it correlates to the lucene index version (or may we want to move to store it in the state file together with the history uuid). I think we can also consider another alternative - only add the allocation id of the stale copy to the insync set once it has been recovered (potentially clearing the in sync set when the command is run). I also wonder if we should do the same for recovery from snapshot (if I read the code correctly, I think we have the same problem)

@ywelsch
Copy link
Contributor

ywelsch commented Sep 19, 2018

only add the allocation id of the stale copy to the insync set once it has been recovered (potentially clearing the in sync set when the command is run)

I would be ok to activate it only after it has been recovered, but I would also like for the allocate stale primary command to immediately take effect (in terms of resetting the in-sync set). An empty in-sync set has currently the semantics of "allocate a fresh shard copy" (i.e. not an existing one), so that's not an option. We could just make it a singleton set with a random allocation id or keep the set as is as long as the newly allocated primary has not been activated.

@bleskes
Copy link
Contributor

bleskes commented Sep 19, 2018

I would also like for the allocate stale primary command to immediately take effect (in terms of resetting the in-sync set).

Agreed.

We could just make it a singleton set with a random allocation id

I first thought this was a hack, but then I thought that maybe we should set it to [ "_forced_allocation" ] which will increase the transparency of what happened until we get back to "normal" (i.e., the primary was started)

@ywelsch
Copy link
Contributor

ywelsch commented Sep 19, 2018

maybe we should set it to [ "_forced_allocation"

works for me.

vladimirdolzhenko pushed a commit to vladimirdolzhenko/elasticsearch that referenced this pull request Sep 28, 2018
vladimirdolzhenko added a commit that referenced this pull request Nov 7, 2018
removes fake allocation id after recovery is done

Relates to #33432
vladimirdolzhenko added a commit that referenced this pull request Nov 7, 2018
removes fake allocation id after recovery is done

Relates to #33432

(cherry picked from commit f789d49)
pgomulka pushed a commit to pgomulka/elasticsearch that referenced this pull request Nov 13, 2018
…4140)

removes fake allocation id after recovery is done

Relates to elastic#33432
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants