Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] XPackRestIT test {p0=ml/jobs_crud/Test reopen job resets the finished time} failing #86877

Closed
droberts195 opened this issue May 18, 2022 · 13 comments · Fixed by elastic/ml-cpp#2272
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

The problem is not related to any particular test. The autodetect process couldn't run due to permission denied starting a thread. The same thing happened every time it was run in this test suite:

[2022-05-18T07:11:32,959][ERROR][o.e.x.m.p.AbstractNativeProcess] [yamlRestTest-0] [jobs-crud-reset-finished-time] autodetect/303389 process stopped unexpectedly: Cannot create thread: Permission denied
Error joining thread: No such process
Fatal error: 'terminate called after throwing an instance of 'std::system_error'', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)
Fatal error: '  what():  Permission denied', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)
Fatal error: 'si_signo 11, si_code: 128, si_errno: 0, address: 0x7fbe8b800898, library: /lib/x86_64-linux-gnu/libc.so.6, base: 0x7fbe8b7d8000, normalized address: 0x28898', version: 8.2.1-SNAPSHOT (build f3adac2acbf65c)

Build scan:
https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fjobs_crud%2FTest%20reopen%20job%20resets%20the%20finished%20time%7D

Reproduction line:
./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/jobs_crud/Test reopen job resets the finished time}" -Dtests.seed=37660F157587F87B -Dtests.locale=da -Dtests.timezone=Asia/Thimphu -Druntime.java=18

Applicable branches:
8.2

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/jobs_crud/Test%20reopen%20job%20resets%20the%20finished%20time%7D

Failure excerpt:

java.lang.AssertionError: Failure at [ml/jobs_crud:1632]: expected [2xx] status code but api [ml.close_job] returned [409 Conflict] [{"error":{"root_cause":[{"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}],"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"},"status":409}]

  at __randomizedtesting.SeedInfo.seed([37660F157587F87B:BF3230CFDB7B9583]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:503)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:472)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:577)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

  Caused by: java.lang.AssertionError: expected [2xx] status code but api [ml.close_job] returned [409 Conflict] [{"error":{"root_cause":[{"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}],"type":"status_exception","reason":"cannot close job [jobs-crud-reset-finished-time] because it failed, use force close","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close job [jobs-crud-reset-finished-time] because it failed, use force close\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:81)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.validate(TransportCloseJobAction.java:266)\n\tat org.elasticsearch.xpack.ml.action.TransportCloseJobAction.lambda$doExecute$6(TransportCloseJobAction.java:157)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.xpack.ml.job.persistence.JobConfigProvider.lambda$expandJobsIds$7(JobConfigProvider.java:523)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.client.internal.node.NodeClient$ActionResponseTaskListener.onResponse(NodeClient.java:175)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:165)\n\tat org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:245)\n\tat org.elasticsearch.action.ActionListener$RunAfterActionListener.onResponse(ActionListener.java:367)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:151)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:105)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:471)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:465)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:275)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:109)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:118)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:93)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"},"status":409}]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:373)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:492)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:472)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.lang.reflect.Method.invoke(Method.java:577)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
    at java.lang.Thread.run(Thread.java:833)

@droberts195 droberts195 added :ml Machine learning >test-failure Triaged test failures from CI labels May 18, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label May 18, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

This happened on an Ubuntu 22.04 worker. It almost certainly means the system call filter in the ML native processes needs adjusting for a new kernel version.

/cc @bytebilly please don't add Ubuntu 22.04 to the Elasticsearch support matrix until this issue is fixed. It seems that until this is fixed ML is completely broken on this distribution. I will aim for 7.17.5/8.2.2/8.3.0.

@droberts195 droberts195 self-assigned this May 18, 2022
droberts195 added a commit to droberts195/ml-cpp that referenced this issue May 18, 2022
Ubuntu 22.04 uses glibc 2.35 and the implementation of
pthread_create has been changed to use more system calls
than it used to.

In order for pthread_create to work on glibc 2.35 we need
to allow the rt_sigprocmask, rseq and clone3 system calls.

Fixes elastic/elasticsearch#86877
droberts195 added a commit to elastic/ml-cpp that referenced this issue May 18, 2022
Ubuntu 22.04 uses glibc 2.35 and the implementation of
pthread_create has been changed to use more system calls
than it used to.

In order for pthread_create to work on glibc 2.35 we need
to allow the rt_sigprocmask, rseq and clone3 system calls.

Fixes elastic/elasticsearch#86877
@mark-vieira
Copy link
Contributor

@droberts195 are we now good to add Ubuntu 22.04 to the general rotation as well as the testing matrix for 7.17?

@droberts195
Copy link
Contributor Author

The ML native processes will now work on Ubuntu 22.04 starting with 8.3.0, 8.2.2 and 7.17.5. But they'll never work for older versions. This is going to be problematic with the BWC tests. Any build that runs the X-Pack BWC tests against versions older than 8.3.0/8.2.2/7.17.5 on Ubuntu 22.04 is going to fail, and since we can't re-release those old versions that's going to be a problem forever.

Therefore we should probably do two things:

  1. Disable all the ML BWC tests if we detect the old version is before 8.3.0/8.2.2/7.17.5 and the glibc version (which can be got from ldconfig --version) is 2.35 or above
  2. Don't use Ubuntu 22.04 for PR builds, because otherwise ML BWC breakages will creep through into the periodic builds

It's interesting that this has come about because our system call filtering (which was added to improve security/reduce attack surface in the event of a breach) has also defeated the Linux developers' BWC efforts. You'd expect a recent version of Linux to run all the software that older versions from the previous few years could run, and usually this would be the case with Ubuntu 22.04 and Ubuntu 20.04, but our system call filter prevents it. If we keep the system call filter then this is going to happen again in the future.

@bytebilly
Copy link
Contributor

@droberts195 are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?

@droberts195
Copy link
Contributor Author

are we now ok to add Ubuntu 22.04 to the support matrix of supported operating systems for 8.3/7.17?

8.3 is fine. 7.17 needs to specifically say 7.17.5 and above. 7.17.0-7.17.4 will never work.

@bytebilly
Copy link
Contributor

The matrix doesn't have this granularity, so I added a footnote to mention that

@mark-vieira
Copy link
Contributor

  • Disable all the ML BWC tests if we detect the old version is before 8.3.0/8.2.2/7.17.5 and the glibc version (which can be got from ldconfig --version) is 2.35 or above

What's the best way to do this. Can we do this in the tests themselves in with assertions?

  • Don't use Ubuntu 22.04 for PR builds, because otherwise ML BWC breakages will creep through into the periodic
    builds

This actually isn't a problem for PR builds since we only test snapshot versions there and those will include the fix. Only the periodic BWC builds are an issue.

@droberts195
Copy link
Contributor Author

What's the best way to do this. Can we do this in the tests themselves in with assertions?

Most of the BWC tests are YAML tests.

I think the best way to skip those ones would be to conditionally add an entry to tests.rest.blacklist that is */*_ml_*/* if the glibc version is 2.34 or above and the old version being upgraded from is < 7.17.5 or >= 8.0.0 and <= 8.2.2.

So to do that we'd somehow need to get Gradle to know the glibc version. It can be done by running ldd --version | grep '^ldd' | sed 's/.* \([1-9]\.[0-9]*\).*/\1/' on Linux. Or obviously if it's easier just the ldd --version can be run as an external command and the text processing can be done in the Gradle script.

Is it possible to make Gradle run an external command during the configuration phase rather than as a task?

Then there are also a few BWC tests that are written in Java rather than YAML. Like you say those can assumeFalse on the glibc version if it can be made available to them. So maybe we just have Gradle set a system property that contains it to pass it through.

I don't think it will be too hard if you could just recommend the best way to get Gradle to run ldd --version early enough that the configuration of the test tasks can know the answer.

@mark-vieira
Copy link
Contributor

Is it possible to make Gradle run an external command during the configuration phase rather than as a task?

It is, but it's highly discouraged since it's expensive to do so and adds overhead to every build invocation. That was my though behind doing this in the test itself, since we'd only do it when attempting to execute the test. I'm wondering if we could implement such a filter in JUnit, even for the YAML tests. I'll have a look at this.

Alternatively, since this only applies to the BWC jobs, maybe we could inject the glibc version as an environment variable or something so we don't have to shell out to ldd during build configuration.

@droberts195
Copy link
Contributor Author

maybe we could inject the glibc version as an environment variable

Yes, that's a good idea. We could potentially add it to the per-worker Jenkins configuration for Linux workers. Then both the build.gradle for the YAML tests and the Java test classes would be able to access it.

Another thing we could potentially do is have the early bootstrap of the Java code (before installing system call filters) call this function using JNA and store the result in a variable that's available to other code later on. That would work nicely for the Java tests. But for the YAML tests we'd need to implement a new type of skip rule that could consider both glibc version and old cluster version. And that is problematic because all the client test harnesses have to understand the YAML syntax.

So, overall, adding a worker-specific environment variable is probably best.

@mark-vieira
Copy link
Contributor

@droberts195 Do we have an ehaustive list of all the test we should mute in this scenario. I notice that not all ML tests fail: https://gradle-enterprise.elastic.co/s/36sbiahfwjjqc/tests/overview?class=org.elasticsearch.xpack.test.rest.XPackRestIT&test=test%20%7Bp0%3Dml/*

Should we blanketly skip all ML tests in BWC scenarios or individual ones? I'm leaning towards the former so we don't find ourselves in a whack-a-mole situtation.

@droberts195
Copy link
Contributor Author

@mark-vieira yes, I agree we should mute all the ML BWC YAML tests when we detect the OS is too new for the old version to work. Otherwise, like you say, almost every newly added test is likely to need another iteration of observing failures, opening issues and adding to the list of tests to mute.

The ones that work currently will be the ones that don't use any ML C++ functionality. But those ones are unlikely to fail in platform-specific ways, so there's not much point adding extra complexity to test them on a distribution where the rest of ML doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants