[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

droberts195 · 2021-02-16T14:07:21Z

When multiple jobs start up together on a node following
an upgrade, each one of them will trigger a check that the
.ml-state* indices are as expected and the .ml-state-write
alias points to the correct index.

There were a couple of flaws in the logic:

We were not considering the possibility that one or more
existing .ml-state* indices might be hidden.
If multiple jobs tried to create a .ml-state-000001 index
simultaneously all but the first would fail. We accounted
for this, but then did not follow up with the correct alias
update request for those index creation requests that
failed. This could cause all but one of the jobs starting
up on the node to spuriously fail.

Both these problems are fixed by this PR.

Fixes #68925

Generally we use the lenientExpandOpen indices options when searching our internal indices. However, there are places where we are not searching but checking existence, and in these places we should not be ignoring unavailable indices. Issue elastic#68925 reveals the sorts of things that can go wrong when decisions are made that should be based on existance of our internal indices but are instead based on current availability. Eventually the unavailable index will become available again, and this will cause problems if we created another index or alias on the assumption it did not exist at all. Fixes elastic#68925

elasticmachine · 2021-02-16T14:07:24Z

Pinging @elastic/ml-core (Team:ML)

przemekwitek

LGTM

As well as switching off the "ignore unavailable" option it also switches to throwing an exception when non-wildcard indices don't exist.

This reverts commit 4e5d90f.

The original approach was wrong. We _do_ want to ignore unavailable indices from the point of view of throwing exceptions, but we want to include closed indices.

When multiple jobs start up together on a node following an upgrade, each one of them will trigger a check that the .ml-state* indices are as expected and the .ml-state-write alias points to the correct index. There were a couple of flaws in the logic: 1. We were not considering the possibility that one or more existing .ml-state* indices might be hidden. 2. If multiple jobs tried to create a .ml-state-000001 index simultaneously all but the first would fail. We accounted for this, but then did not follow up with the correct alias update request for those index creation requests that failed. This could cause all but one of the jobs starting up on the node to spuriously fail. Both these problems are fixed by this PR. Backport of elastic#69039

…69279) When multiple jobs start up together on a node following an upgrade, each one of them will trigger a check that the .ml-state* indices are as expected and the .ml-state-write alias points to the correct index. There were a couple of flaws in the logic: 1. We were not considering the possibility that one or more existing .ml-state* indices might be hidden. 2. If multiple jobs tried to create a .ml-state-000001 index simultaneously all but the first would fail. We accounted for this, but then did not follow up with the correct alias update request for those index creation requests that failed. This could cause all but one of the jobs starting up on the node to spuriously fail. Both these problems are fixed by this PR. Backport of #69039

…69282) When multiple jobs start up together on a node following an upgrade, each one of them will trigger a check that the .ml-state* indices are as expected and the .ml-state-write alias points to the correct index. There were a couple of flaws in the logic: 1. We were not considering the possibility that one or more existing .ml-state* indices might be hidden. 2. If multiple jobs tried to create a .ml-state-000001 index simultaneously all but the first would fail. We accounted for this, but then did not follow up with the correct alias update request for those index creation requests that failed. This could cause all but one of the jobs starting up on the node to spuriously fail. Both these problems are fixed by this PR. Backport of #69039

…69280) When multiple jobs start up together on a node following an upgrade, each one of them will trigger a check that the .ml-state* indices are as expected and the .ml-state-write alias points to the correct index. There were a couple of flaws in the logic: 1. We were not considering the possibility that one or more existing .ml-state* indices might be hidden. 2. If multiple jobs tried to create a .ml-state-000001 index simultaneously all but the first would fail. We accounted for this, but then did not follow up with the correct alias update request for those index creation requests that failed. This could cause all but one of the jobs starting up on the node to spuriously fail. Both these problems are fixed by this PR. Backport of #69039

droberts195 added >bug :ml Machine learning v8.0.0 v7.12.0 v7.11.2 labels Feb 16, 2021

elasticmachine added the Team:ML Meta label for the ML team label Feb 16, 2021

przemekwitek self-requested a review February 16, 2021 14:19

przemekwitek approved these changes Feb 16, 2021

View reviewed changes

We get bitten by multiple meanings of "strict"

4e5d90f

As well as switching off the "ignore unavailable" option it also switches to throwing an exception when non-wildcard indices don't exist.

droberts195 changed the title ~~[ML] Use strict expansion for internal index existence checks~~ [ML] Include closed indices for internal index existence checks Feb 16, 2021

droberts195 added 2 commits February 16, 2021 15:59

Revert "We get bitten by multiple meanings of "strict""

f915109

This reverts commit 4e5d90f.

Fix approach

0b48f0b

The original approach was wrong. We _do_ want to ignore unavailable indices from the point of view of throwing exceptions, but we want to include closed indices.

williamrandolph added v7.13.0 and removed v7.12.0 labels Feb 18, 2021

droberts195 added the v7.12.0 label Feb 19, 2021

Merge branch 'master' into existence_checks_should_include_unavailable

2b4847a

droberts195 changed the title ~~[ML] Include closed indices for internal index existence checks~~ [ML] Fix logic for moving .ml-state-write alias from legacy to new Feb 19, 2021

Most important piece of the fix

01f93c7

droberts195 merged commit bc46fc0 into elastic:master Feb 19, 2021

droberts195 deleted the existence_checks_should_include_unavailable branch February 19, 2021 14:43

droberts195 mentioned this pull request Feb 19, 2021

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69279

Merged

droberts195 mentioned this pull request Feb 19, 2021

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69280

Merged

droberts195 mentioned this pull request Feb 19, 2021

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69282

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

droberts195 commented Feb 16, 2021 •

edited

Loading

elasticmachine commented Feb 16, 2021

przemekwitek left a comment

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

[ML] Fix logic for moving .ml-state-write alias from legacy to new #69039

Conversation

droberts195 commented Feb 16, 2021 • edited Loading

elasticmachine commented Feb 16, 2021

przemekwitek left a comment

Choose a reason for hiding this comment

droberts195 commented Feb 16, 2021 •

edited

Loading