Rollup jobs should be cleaned up before indices are deleted #38930

polyfractal · 2019-02-14T22:45:49Z

Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout.

Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled, etc). This tends to knock over several tests before the bulk finally times out and the job shuts down.

Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive.

stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which tended to raise the above timing situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

Should, theoretically, close #35295 and #38877, both which I think are caused by the same issue. The timing issue reproduces so rarely it's hard to say for certain, but I'll keep my machine crunching on it too.

Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

elasticmachine · 2019-02-14T22:45:51Z

Pinging @elastic/es-analytics-geo

talevy

LGTM

…eout

jimczi

LGTM

test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java

polyfractal · 2019-02-15T14:46:02Z

@elasticmachine run elasticsearch-ci/1

* elastic/master: Ensure global test seed is used for all random testing tasks (elastic#38991) re-mutes SmokeTestWatcherWithSecurityIT (elastic#38995) Rollup jobs should be cleaned up before indices are deleted (elastic#38930) relax ML Info Docs expected response (elastic#38993) Re-enable single node tests (elastic#38852) ClusterClientIT refactor (elastic#38872) Fix typo in Index API doc Edits to text & formatting in Term Suggester doc (elastic#38963) (elastic#38989) Migrate Streamable to Writeable for WatchStatus (elastic#37390)

…38930) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

…39152) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

…39145) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

…39144) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

…39152) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.

$polyfractal$

$@polyfractal$ polyfractal added >test Issues or PRs that are addressing/adding tests :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Feb 14, 2019

talevy approved these changes Feb 14, 2019

View reviewed changes

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rollup_fix_test_tim…

e6e05d7

…eout

jimczi approved these changes Feb 15, 2019

View reviewed changes

test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java Outdated Show resolved Hide resolved

$@polyfractal$

typo

201e287

$@polyfractal$ polyfractal merged commit e26e929 into elastic:master Feb 16, 2019

$@polyfractal$ polyfractal added v7.0.0 v6.7.0 v8.0.0 v7.2.0 labels Feb 16, 2019

jkakavas mentioned this pull request Feb 18, 2019

[CI] wipeRollupJobs fails to cleanup job and causes multiple test failures #38877

Closed

This was referenced Feb 19, 2019

Rollup jobs should be cleaned up before indices are deleted (#38930) #39144

Merged

Rollup jobs should be cleaned up before indices are deleted (#38930) #39145

Merged

talevy added the backport pending label Feb 19, 2019

talevy mentioned this pull request Feb 19, 2019

Rollup jobs should be cleaned up before indices are deleted (#38930) #39152

Merged

talevy removed the backport pending label Feb 20, 2019

$@polyfractal$ polyfractal added the v6.6.2 label Feb 26, 2019

$@polyfractal$ polyfractal mentioned this pull request Feb 26, 2019

[CI] Deleting user fails after running tests in XDocsClientYamlTestSuiteIT #32000

Closed

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollup jobs should be cleaned up before indices are deleted #38930

Rollup jobs should be cleaned up before indices are deleted #38930

$@polyfractal$ polyfractal commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

talevy left a comment

jimczi left a comment

polyfractal commented Feb 15, 2019

Rollup jobs should be cleaned up before indices are deleted #38930

Rollup jobs should be cleaned up before indices are deleted #38930

Conversation

polyfractal commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

talevy left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

polyfractal commented Feb 15, 2019

$@polyfractal$ polyfractal commented Feb 14, 2019