Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DatafeedJobsRestIT multiple tests failing due to unfinished tasks #105239

Open
alex-spies opened this issue Feb 7, 2024 · 4 comments
Open
Labels
:ml Machine learning needs:risk Requires assignment of a risk label (low, medium, blocker) Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@alex-spies
Copy link
Contributor

All of these tests fail with an assertion error thrown in clearMlState (DatafeedJobsRestIT, line 1705):

  • testLookbackOnlyWithKeywordMultiField
  • testLookbackOnlyWithRuntimeFields
  • testLookbackOnlyGivenAggregationsWithDateHistogram
  • testLookbackOnlyWithTextMultiField
  • testLookbackOnlyGivenAggregationsWithHistogram
  • testForceDeleteWhileDatafeedIsRunning
  • testLookbackWithPipelineBucketAgg

In all cases, there are 2 pending tasks

    java.lang.AssertionError: 2 active tasks found:
    cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
    cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1 

The tasks seem to be exactly the same in all the tests; they could be from a preceding test.

Build scan:
https://gradle-enterprise.elastic.co/s/iyd5l5a5elixk/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT/testLookbackOnlyWithKeywordMultiField

Reproduction line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.testLookbackOnlyWithKeywordMultiField" -Dtests.seed=8BE1A8358268DF09 -Dtests.locale=ar-SA -Dtests.timezone=Etc/GMT+10 -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT#testLookbackOnlyWithKeywordMultiField

Failure excerpt:

java.lang.AssertionError: 2 active tasks found:
cluster:admin/xpack/ml/job/close    vdFO27UiR8-yA205mgAQaQ:25140 -                            transport  1707306145310 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
cluster:admin/xpack/ml/job/close[n] vdFO27UiR8-yA205mgAQaQ:25144 vdFO27UiR8-yA205mgAQaQ:25140 transport  1707306145323 11:42:25 1.7m        127.0.0.1 javaRestTest-1 
 expected:<0> but was:<2>

  at __randomizedtesting.SeedInfo.seed([8BE1A8358268DF09:A3EF966D97D472]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.failNotEquals(Assert.java:835)
  at org.junit.Assert.assertEquals(Assert.java:647)
  at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForPendingTasks$5(ESRestTestCase.java:625)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1255)
  at org.elasticsearch.test.rest.ESRestTestCase.waitForPendingTasks(ESRestTestCase.java:596)
  at org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT.clearMlState(DatafeedJobsRestIT.java:1705)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@alex-spies alex-spies added >test-failure Triaged test failures from CI Team:ML Meta label for the ML team :ml Machine learning labels Feb 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@danielmitterdorfer
Copy link
Member

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Feb 12, 2024
@danielmitterdorfer
Copy link
Member

I've muted the entire suite in #105387.

elasticsearchmachine pushed a commit that referenced this issue Feb 12, 2024
martijnvg pushed a commit to martijnvg/elasticsearch that referenced this issue Feb 12, 2024
@droberts195
Copy link
Contributor

Another instance of this test failure is here: https://gradle-enterprise.elastic.co/s/cohmjlmusxiwk

For this particular failure, the first failed test is testInsufficientSearchPrivilegesOnPutWithJob. It's the tasks left over from this that cause all the subsequent failures.

The tasks that are left behind are "Close Job" tasks from the feature reset at the end of the test.

The logs from that test are:

[2024-02-12T06:05:18,268][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] before test	
[2024-02-12T06:05:18,271][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] initializing REST clients against [http://[::1]:33405, http://127.0.0.1:43365/, http://[::1]:44817, http://127.0.0.1:35533/, http://[::1]:33307, http://127.0.0.1:34623/]	
[2024-02-12T06:05:18,326][WARN ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] This test is running on the legacy test framework; historical features from production code will not be available. You need to port the test to the new test plugins in order to use historical features from production code. If this is a legacy feature used only in tests, you can add it to a test-only FeatureSpecification such as org.elasticsearch.test.rest.RestTestLegacyFeatures.	
[2024-02-12T06:05:53,485][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] There are still tasks running after this test that might break subsequent tests [cluster:admin/xpack/ml/job/close, cluster:admin/xpack/ml/job/close[n], health-node[c]].	
[2024-02-12T06:05:53,486][INFO ][o.e.x.m.i.DatafeedJobsRestIT] [testInsufficientSearchPrivilegesOnPutWithJob] after test

The corresponding server-side logs are:

[2024-02-12T10:05:20,144][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:20,163][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.security-7/wN4X6LmhQ5aRUiTplHSv6Q] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-annotations-000001/TZRK7zHRTh29VCkufU6rkw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-notifications-000002/tlCYHYjtRbuULe3IyJHeLw] deleting index
[2024-02-12T10:05:20,236][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-anomalies-shared/zEHUsV2VQPigXP1Bg6ePdA] deleting index
[2024-02-12T10:05:20,268][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-1] [.ml-config/jkJoq62lQB-igtiqWsGzIQ] deleting index
[2024-02-12T10:05:20,306][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset
[2024-02-12T10:05:53,292][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Starting machine learning feature reset
[2024-02-12T10:05:53,371][INFO ][o.e.x.m.MachineLearning  ] [javaRestTest-1] Finished machine learning feature reset

It's interesting that there are actually two feature resets there.

Of the 3 nodes in the native multi-node tests, the node with the outstanding tasks is iov4A-EPQ2-2Qwgemj_22w, which is javaRestTest-2. The coordinating node for the feature resets in testInsufficientSearchPrivilegesOnPutWithJob is javaRestTest-1.

One thing that's strange about testInsufficientSearchPrivilegesOnPutWithJob having a "Close Job" call hang is that the whole point of the test is that it doesn't successfully create a job. So there should be nothing to close. The likely explanation is that the job that hangs "Close Job" actually comes from the previous test and somehow wasn't cleaned up correctly after that test.

The job that's likely to blame is job-for-start-datafeed-timeout which comes from DatafeedJobsIT.testStartDatafeed_GivenTimeout_Returns408. That is a test from a different suite to the one that failed here. DatafeedJobsIT is using the transport Java classes and DatafeedJobsRestIT is using the REST API. The differing cleanup was the subject of #49582, and may be part of the problem here.

Another thing I've just noticed is that the suite that was muted was DatafeedJobsIT, and not DatafeedJobsRestIT which had the failures. Given that the suspicion is that a test from DatafeedJobsIT is the root cause of the failures here that might actually be fortuitous. It will be interesting to see if muting DatafeedJobsIT stops further failures in DatafeedJobsRestIT.

@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning needs:risk Requires assignment of a risk label (low, medium, blocker) Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants