Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/30_ml_jobs_crud/Test open old jobs} failing #79636

Closed
droberts195 opened this issue Oct 21, 2021 · 5 comments · Fixed by #79731
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

In this failure the .ml-state-write alias did not exist after upgrading a 7.3.2 cluster to 7.16.0. This is a bit worrying - have we done anything recently that would mean we don't reliably set up the .ml-state-write alias when upgrading from a version that didn't have it?

Build scan:
https://gradle-enterprise.elastic.co/s/2uxstqtvl3e2c/tests/:x-pack:qa:rolling-upgrade:v7.3.2%23upgradedClusterTest/org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT/test%20%7Bp0=upgraded_cluster%2F30_ml_jobs_crud%2FTest%20open%20old%20jobs%7D

Reproduction line:
./gradlew ':x-pack:qa:rolling-upgrade:v7.3.2#upgradedClusterTest' -Dtests.class="org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=upgraded_cluster/30_ml_jobs_crud/Test open old jobs}" -Dtests.seed=BD58859DCE7F0251 -Dtests.bwc=true -Dtests.locale=fi -Dtests.timezone=America/Guatemala -Druntime.java=8

Applicable branches:
7.16

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT&tests.test=test%20%7Bp0%3Dupgraded_cluster/30_ml_jobs_crud/Test%20open%20old%20jobs%7D

Failure excerpt:

java.lang.AssertionError: Failure at [upgraded_cluster/30_ml_jobs_crud:110]: field [] doesn't have a true value
Expected: not a string equal to "false" ignoring case
     but: was "false"

  at __randomizedtesting.SeedInfo.seed([BD58859DCE7F0251:350CBA4760836FA9]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:462)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:439)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)

  Caused by: java.lang.AssertionError: field [] doesn't have a true value
  Expected: not a string equal to "false" ignoring case
       but: was "false"

    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
    at org.junit.Assert.assertThat(Assert.java:956)
    at org.elasticsearch.test.rest.yaml.section.IsTrueAssertion.doAssert(IsTrueAssertion.java:47)
    at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:65)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:455)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:439)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at java.lang.Thread.run(Thread.java:748)

@droberts195 droberts195 added :ml Machine learning >test-failure Triaged test failures from CI labels Oct 21, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Oct 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

The server side logs suggest it did exist:

bash-3.2$ grep ml-state-write v7.3.2-?/logs/v7.3.2.log
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:06,894][DEBUG][o.e.x.m.j.p.JobResultsProvider] [v7.3.2-0] Found 0 hits for [[.ml-state, .ml-state-write]]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:07,681][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.3.2-0] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-state-write with ID [ml-snapshots-upgrade-job_quantiles]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:10,818][TRACE][o.e.x.m.j.p.JobResultsPersister] [v7.3.2-0] [ml-snapshots-upgrade-job] ES API CALL: to index .ml-state-write with ID [ml-snapshots-upgrade-job_quantiles]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:42,669][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [v7.3.2-0] About to create first concrete index [.ml-state-000001] with alias [.ml-state-write]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:42,739][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [v7.3.2-0] About to move write alias [.ml-state-write] from index [.ml-state] to index [.ml-state-000001]

I think the problem is possibly that all the shards of the .ml-state-000001 index were stale at the time the alias check was done:

bash-3.2$ grep ml-state-000001 v7.3.2-?/logs/v7.3.2.log
v7.3.2-1/logs/v7.3.2.log:[2021-10-21T07:25:42,677][INFO ][o.e.c.m.MetaDataCreateIndexService] [v7.3.2-1] [.ml-state-000001] creating index, cause [api], templates [.ml-state], shards [1]/[1], mappings [_doc]
v7.3.2-1/logs/v7.3.2.log:[2021-10-21T07:25:42,773][INFO ][o.e.c.r.a.AllocationService] [v7.3.2-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-state-000001][0]] ...]).
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:42,669][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [v7.3.2-0] About to create first concrete index [.ml-state-000001] with alias [.ml-state-write]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:25:42,739][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [v7.3.2-0] About to move write alias [.ml-state-write] from index [.ml-state] to index [.ml-state-000001]
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:26:58,085][INFO ][o.e.i.s.IndexShard       ] [v7.3.2-0] [.ml-state-000001][0] primary-replica resync completed with 0 operations
v7.3.2-0/logs/v7.3.2.log:[2021-10-21T07:27:22,070][WARN ][o.e.c.r.a.AllocationService] [v7.3.2-0] [.ml-state-000001][0] marking unavailable shards as stale: [xg16nId1Qy6zX2mgLrTXqw]

The test that asserts the alias exists opens two jobs before that. It seems that we are not waiting for .ml-state-000001 to be available before opening jobs.

@droberts195
Copy link
Contributor Author

It should have been checked by this:

// Make sure the state index and alias exist
ActionListener<Boolean> resultsMappingUpdateHandler = ActionListener.wrap(
ack -> AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary(client, clusterState, expressionResolver, masterNodeTimeout,
stateAliasHandler),
e -> closeHandler.accept(e, true)
);

@benwtrent benwtrent self-assigned this Oct 21, 2021
@benwtrent
Copy link
Member

There was a giant refactor on how Index metadata is accessed (changing how aliases and indices are gathered internally). #79080

This is something to keep in mind if other things start failing. But since this change is very encompassing, I would expect something other than ML to get a failure :/.

Still digging.

@benwtrent
Copy link
Member

It seems that we are not waiting for .ml-state-000001 to be available before opening jobs.

We definitely are not. We specifically wait for yellow for the new annotations index, but do nothing of the sort for .ml-state-000001.

I think if we were to wait for something for .ml-state-000001 green is the only thing that makes sense. yellow does nothing really as the job wouldn't need to read a previously stored state (its a brand new index), but it may need to write new state to the index (which could fail if its newly created and the shards are all stale).

benwtrent added a commit that referenced this issue Oct 26, 2021
In tests and actual usage, it is possible that one job creates the .ml-state-write and another starts immediately afterwards, sees that the index is created, and moves on. But, what this means, is that the second job could blast past the check and the job starts/stops/etc. all with the .ml-state-write alias pointing to an index that is not even readable.

This commit waits for the index to be yellow before continuing opening the job.

closes: #79636
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Oct 26, 2021
In tests and actual usage, it is possible that one job creates the .ml-state-write and another starts immediately afterwards, sees that the index is created, and moves on. But, what this means, is that the second job could blast past the check and the job starts/stops/etc. all with the .ml-state-write alias pointing to an index that is not even readable.

This commit waits for the index to be yellow before continuing opening the job.

closes: elastic#79636
elasticsearchmachine pushed a commit that referenced this issue Oct 26, 2021
In tests and actual usage, it is possible that one job creates the .ml-state-write and another starts immediately afterwards, sees that the index is created, and moves on. But, what this means, is that the second job could blast past the check and the job starts/stops/etc. all with the .ml-state-write alias pointing to an index that is not even readable.

This commit waits for the index to be yellow before continuing opening the job.

closes: #79636
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this issue Oct 28, 2021
In tests and actual usage, it is possible that one job creates the .ml-state-write and another starts immediately afterwards, sees that the index is created, and moves on. But, what this means, is that the second job could blast past the check and the job starts/stops/etc. all with the .ml-state-write alias pointing to an index that is not even readable.

This commit waits for the index to be yellow before continuing opening the job.

closes: elastic#79636
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants