Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] org.elasticsearch.xpack.ml.integration.SetUpgradeModeIT SuitTimeOut Failure #49467

Closed
original-brownbear opened this issue Nov 21, 2019 · 5 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

In 7.5 org.elasticsearch.xpack.ml.integration.SetUpgradeModeIT timed out on CI just now in run https://gradle-enterprise.elastic.co/s/jp5zidernhxdc/console-log#L3830.


org.elasticsearch.xpack.ml.integration.SetUpgradeModeIT > testEnableUpgradeMode FAILED
java.lang.Exception: Test abandoned because suite timeout was reached.
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.SetUpgradeModeIT.testEnableUpgradeMode" -Dtests.seed=2E5D3B474E8B782F -Dtests.security.manager=true -Dtests.locale=ru-RU -Dtests.timezone=Etc/GMT-10 -Dcompiler.java=12 -Druntime.java=8

@original-brownbear original-brownbear added >test-failure Triaged test failures from CI :ml Machine learning labels Nov 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@davidkyle davidkyle self-assigned this Dec 16, 2019
@davidkyle
Copy link
Member

The timeout failure is not a one off I've found another recently: https://gradle-enterprise.elastic.co/s/xjcyfh4xigvvm

In both cases the suite times out and jstack shows this stack trace on the client:

"TEST-SetUpgradeModeIT.testEnableUpgradeMode-seed#[33082EFCEF9D5830]" ID=96 WAITING on org.elasticsearch.common.util.concurrent.BaseFuture$Sync@1d211a26
    at sun.misc.Unsafe.park(Native Method)
    - waiting on org.elasticsearch.common.util.concurrent.BaseFuture$Sync@1d211a26
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34)
    at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.stopDatafeed(MlNativeAutodetectIntegTestCase.java:177)
    at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.cleanUpDatafeeds(MlNativeAutodetectIntegTestCase.java:96)
    at org.elasticsearch.xpack.ml.integration.MlNativeAutodetectIntegTestCase.cleanUpResources(MlNativeAutodetectIntegTestCase.java:89)
    at org.elasticsearch.xpack.ml.integration.MlNativeIntegTestCase.cleanUp(MlNativeIntegTestCase.java:98)
    at org.elasticsearch.xpack.ml.integration.SetUpgradeModeIT.cleanup(SetUpgradeModeIT.java:44)

The test is passing but there is a problem stopping the datafeed in the cleanup. Later tests are failing because the persistent task associated with the datafeed has not stopped and the stop datafeed transport action is still running.

xpack/ml/datafeed[c]                 klrAQw8HQ1qTn7SZe4ZvEw:1740  cluster:7                    persistent 1575459442578 11:37:22 26.9m       127.0.0.1 integTest-1 datafeed-realtime-job-test-enable-upgrade-mode-datafeed
cluster:admin/xpack/ml/datafeed/stop 0_wtc71iQ4C5t6vbmt3oQA:1591  -                            transport  1575459442642 11:37:22 26.9m          

That is interesting because admin/xpack/ml/datafeed/stop has a timeout of 5 minutes but the test times out after 20 minutes. From the logs I surmise the stop request has been hanging for a good 20 minutes.

The test toggles ML upgrade mode which stops the job and datafeed. When upgrade mode is turned back on the datafeed automatically restarts and the test succeeds when the datafeed is assigned a node. Test teardown then stops the datafeed. I think the problem is a race condition between the datafeed re-starting and it being stopped in the teardown, i.e. when the datafeed says it has been assigned to a node it has not started then the stop request hangs on a datafeed that has not started. The logs support this:

// This is the stop request sent by test cleanup
[2019-12-04T11:37:22,720][INFO ][o.e.x.m.d.DatafeedManager] [integTest-1] [stop_datafeed (api)] attempt to stop datafeed [realtime-job-test-enable-upgrade-mode-datafeed] [7]
...
// This is the datafeed re-starting after upgrade mode is turned off but is happening chronologically after the stop request
[2019-12-04T11:37:22,752][INFO ][o.e.x.m.d.DatafeedJob    ] [integTest-1] [realtime-job-test-enable-upgrade-mode] Datafeed started (from: 2019-12-04T11:37:17.799Z to: real-time) with frequency [1000ms]

@davidkyle
Copy link
Member

Possibly fixed by #51302

@droberts195
Copy link
Contributor

Possibly fixed by #51302

Interestingly from #51285 (comment):

It looks like the resulting NPE is ignored in AllocatedPersistentTask.completeAndNotifyIfNeeded since the task is already completed.

This would explain why no NPE was observed in the triage of this issue, and makes it highly likely that #51302 does fix this.

Therefore I'll close this optimistically. Please reopen if it happens again.

@davidkyle
Copy link
Member

Copying from #51302 (comment) the following explains the reason for the timeout:

I would have spotted in NPE in the logs when triaging #49467. That test timed out waiting for the datafeed to stop, my theory is the NPE took down the worker thread that is vital in stopping the DF so a stopped status was never reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants