[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

alex-spies · 2024-02-07T12:54:23Z

All of these tests fail with an assertion error thrown in clearMlState (DatafeedJobsRestIT, line 1705):

testLookbackOnlyWithKeywordMultiField
testLookbackOnlyWithRuntimeFields
testLookbackOnlyGivenAggregationsWithDateHistogram
testLookbackOnlyWithTextMultiField
testLookbackOnlyGivenAggregationsWithHistogram
testForceDeleteWhileDatafeedIsRunning
testLookbackWithPipelineBucketAgg

In all cases, there are 2 pending tasks

    java.lang.AssertionError: 2 active tasks found:
    cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
    cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1

The tasks seem to be exactly the same in all the tests; they could be from a preceding test.

Build scan:
https://gradle-enterprise.elastic.co/s/iyd5l5a5elixk/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT/testLookbackOnlyWithKeywordMultiField

Reproduction line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.testLookbackOnlyWithKeywordMultiField" -Dtests.seed=8BE1A8358268DF09 -Dtests.locale=ar-SA -Dtests.timezone=Etc/GMT+10 -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT#testLookbackOnlyWithKeywordMultiField

Failure excerpt:

java.lang.AssertionError: 2 active tasks found:
cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
 expected:<0> but was:<2>

  at __randomizedtesting.SeedInfo.seed([8BE1A8358268DF09:A3EF966D97D472]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.failNotEquals(Assert.java:835)
  at org.junit.Assert.assertEquals(Assert.java:647)
  at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForPendingTasks$5(ESRestTestCase.java:625)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1255)
  at org.elasticsearch.test.rest.ESRestTestCase.waitForPendingTasks(ESRestTestCase.java:596)
  at org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.clearMlState(DatafeedJobsRestIT.java:1705)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-02-07T12:54:55Z

Pinging @elastic/ml-core (Team:ML)

danielmitterdorfer · 2024-02-12T10:24:30Z

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

Relates elastic#105239

danielmitterdorfer · 2024-02-12T10:37:20Z

I've muted the entire suite in #105387.

Relates #105239

Relates elastic#105239

droberts195 · 2024-02-13T15:37:43Z

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

For this particular failure, the first failed test is testInsufficientSearchPrivilegesOnPutWithJob. It's the tasks left over from this that cause all the subsequent failures.

The tasks that are left behind are "Close Job" tasks from the feature reset at the end of the test.

The logs from that test are:

[2024-02-12T06:05:18,268][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] before test	
[2024-02-12T06:05:18,271][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] initializing REST clients against [http://[::1]:33405, http://127.0.0.1:43365/, http://[::1]:44817, http://127.0.0.1:35533/, http://[::1]:33307, http://127.0.0.1:34623/]	
[2024-02-12T06:05:18,326][WARN ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] This test is running on the legacy test framework; historical features from production code will not be available. You need to port the test to the new test plugins in order to use historical features from production code. If this is a legacy feature used only in tests, you can add it to a test-only FeatureSpecification such as org.elasticsearch.test.rest.RestTestLegacyFeatures.	
[2024-02-12T06:05:53,485][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] There are still tasks running after this test that might break subsequent tests [cluster:admin/xpack/ml/job/close, cluster:admin/xpack/ml/job/close[n], health-node[c]].	
[2024-02-12T06:05:53,486][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] after test

The corresponding server-side logs are:

[2024-02-12T10:05:20,144][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:20,163][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.security-7/wN4X6LmhQ5aRUiTplHSv6Q] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-annotations-000001/TZRK7zHRTh29VCkufU6rkw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-notifications-000002/tlCYHYjtRbuULe3IyJHeLw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-anomalies-shared/zEHUsV2VQPigXP1Bg6ePdA] deleting index
[2024-02-12T10:05:20,268][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-config/jkJoq62lQB-igtiqWsGzIQ] deleting index
[2024-02-12T10:05:20,306][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset
[2024-02-12T10:05:53,292][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:53,371][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset

It's interesting that there are actually two feature resets there.

Of the 3 nodes in the native multi-node tests, the node with the outstanding tasks is iov4A-EPQ2-2Qwgemj_22w, which is javaRestTest-2. The coordinating node for the feature resets in testInsufficientSearchPrivilegesOnPutWithJob is javaRestTest-1.

One thing that's strange about testInsufficientSearchPrivilegesOnPutWithJob having a "Close Job" call hang is that the whole point of the test is that it doesn't successfully create a job. So there should be nothing to close. The likely explanation is that the job that hangs "Close Job" actually comes from the previous test and somehow wasn't cleaned up correctly after that test.

The job that's likely to blame is job-for-start-datafeed-timeout which comes from DatafeedJobsIT.testStartDatafeed_GivenTimeout_Returns408. That is a test from a different suite to the one that failed here. DatafeedJobsIT is using the transport Java classes and DatafeedJobsRestIT is using the REST API. The differing cleanup was the subject of #49582, and may be part of the problem here.

Another thing I've just noticed is that the suite that was muted was DatafeedJobsIT, and not DatafeedJobsRestIT which had the failures. Given that the suspicion is that a test from DatafeedJobsIT is the root cause of the failures here that might actually be fortuitous. It will be interesting to see if muting DatafeedJobsIT stops further failures in DatafeedJobsRestIT.

alex-spies added >test-failure Triaged test failures from CI Team:ML Meta label for the ML team :ml Machine learning labels Feb 7, 2024

elasticsearchmachine added the blocker label Feb 7, 2024

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Feb 12, 2024

Mute DatafeedJobsIT

76e3050

Relates elastic#105239

danielmitterdorfer mentioned this issue Feb 12, 2024

Mute DatafeedJobsIT #105387

Merged

elasticsearchmachine pushed a commit that referenced this issue Feb 12, 2024

Mute DatafeedJobsIT (#105387)

7f1e81d

Relates #105239

martijnvg pushed a commit to martijnvg/elasticsearch that referenced this issue Feb 12, 2024

Mute DatafeedJobsIT (elastic#105387)

e23e2f3

Relates elastic#105239

mark-vieira removed the blocker label Apr 24, 2024

elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

alex-spies commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

danielmitterdorfer commented Feb 12, 2024

danielmitterdorfer commented Feb 12, 2024

droberts195 commented Feb 13, 2024

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

Comments

alex-spies commented Feb 7, 2024

elasticsearchmachine commented Feb 7, 2024

danielmitterdorfer commented Feb 12, 2024

danielmitterdorfer commented Feb 12, 2024

droberts195 commented Feb 13, 2024