Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Feb 16, 2021

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

  1. We were not considering the possibility that one or more
    existing .ml-state* indices might be hidden.
  2. If multiple jobs tried to create a .ml-state-000001 index
    simultaneously all but the first would fail. We accounted
    for this, but then did not follow up with the correct alias
    update request for those index creation requests that
    failed. This could cause all but one of the jobs starting
    up on the node to spuriously fail.

Both these problems are fixed by this PR.

Fixes #68925

Generally we use the lenientExpandOpen indices options
when searching our internal indices. However, there are
places where we are not searching but checking existence,
and in these places we should not be ignoring unavailable
indices. Issue elastic#68925 reveals the sorts of things that can
go wrong when decisions are made that should be based on
existance of our internal indices but are instead based on
current availability. Eventually the unavailable index will
become available again, and this will cause problems if we
created another index or alias on the assumption it did not
exist at all.

Fixes elastic#68925
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@przemekwitek przemekwitek self-requested a review February 16, 2021 14:19
Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As well as switching off the "ignore unavailable" option
it also switches to throwing an exception when non-wildcard
indices don't exist.
@droberts195 droberts195 changed the title [ML] Use strict expansion for internal index existence checks [ML] Include closed indices for internal index existence checks Feb 16, 2021
The original approach was wrong.

We _do_ want to ignore unavailable indices from the point
of view of throwing exceptions, but we want to include closed
indices.
@droberts195 droberts195 changed the title [ML] Include closed indices for internal index existence checks [ML] Fix logic for moving .ml-state-write alias from legacy to new Feb 19, 2021
@droberts195 droberts195 merged commit bc46fc0 into elastic:master Feb 19, 2021
@droberts195 droberts195 deleted the existence_checks_should_include_unavailable branch February 19, 2021 14:43
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 19, 2021
When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of elastic#69039
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 19, 2021
When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of elastic#69039
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 19, 2021
When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of elastic#69039
droberts195 added a commit that referenced this pull request Feb 19, 2021
…69279)

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of #69039
droberts195 added a commit that referenced this pull request Feb 19, 2021
…69282)

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of #69039
droberts195 added a commit that referenced this pull request Feb 19, 2021
…69280)

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

1. We were not considering the possibility that one or more
   existing .ml-state* indices might be hidden.
2. If multiple jobs tried to create a .ml-state-000001 index
   simultaneously all but the first would fail.  We accounted
   for this, but then did not follow up with the correct alias
   update request for those index creation requests that
   failed.  This could cause all but one of the jobs starting
   up on the node to spuriously fail.

Both these problems are fixed by this PR.

Backport of #69039
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Possibility to end up with .ml-state-write alias on multiple indices
5 participants