ML QA tasks are slow, take up 20 minutes of build time #37339

alpar-t · 2019-01-11T07:27:18Z

Looking at this build scan from a master intake CI job:
https://scans.gradle.com/s/wqhe4ax2bx7zc/performance/execution

It looks like 2 of the ML test tasks take up 20 minutes of our non parallelized section of the build:

:x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner	12m 53.964s	 	 
:x-pack:plugin:ml:qa:ml-with-security:integTestRunner         	        8m 41.424s

From the same logs:

:x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner
==> Test Info: seed=BAC54799DFF31721; jvm=1; suites=22

Slow Tests Summary:
--
146.90s \| org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT
131.64s \| org.elasticsearch.xpack.ml.integration.ForecastIT
83.96s \| org.elasticsearch.xpack.ml.integration.MlUpgradeIT
71.78s \| org.elasticsearch.xpack.ml.integration.DatafeedJobsIT
63.10s \| org.elasticsearch.xpack.ml.integration.DeleteExpiredDataIT

and

:x-pack:plugin:ml:qa:ml-with-security:integTestRunner
--
==> Test Info: seed=4D0F3BB1F5019D2C; jvm=1; suites=3
==> Test Summary: 3 suites, 798 tests, 225 ignored (225 assumptions)

We should look into why these tests are so slow and ways to make them faster.
One things that sticks out is that there are both ESIntegTestCase and ESRestTestCase mixed in the same tasks. The former can be spread out across multiple JVMs running in parallel if we were to have a dedicated task for those.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-11T07:27:19Z

Pinging @elastic/ml-core

hendrikmuhs · 2019-01-11T08:12:03Z

I guess the slow part of ForecastIT is the one that creates a lot of data on the fly to trigger a big ml model. Optimization is always possible, but I doubt we can make it significantly faster. I do not want to change anything on the test itself, it's important that we test this code path which triggers special handling (huge model that requires overflow to disk).

Having that said, I can move the slow parts into a separate package if that helps, I guess that's what you mean with "dedicated task".

A followup option could be to split ForecastIT into separate classes, while the heavy ones are only executed if a certain criteria is met ("something with ML"). The quick tests should still always execute I think.

alpar-t · 2019-01-11T09:07:16Z

We are not generally thinking about a slow vs fast split for testing yet and we shouldn't weaken the test to make it faster. We could maybe structure it so we can run more of them in parallel ? This is what I meant by a separate task, tests that inherit from ESIntegTestCase can be ran in parallel. We run ESRestTestCase in sequence because they run against the same cluster, but depending on what the tests are doing we might be able run more in parallel of those too ?

droberts195 · 2019-01-11T09:26:37Z

:x-pack:plugin:ml:qa:ml-with-security is made up of hundreds of small tests. It re-runs all the ML REST tests with 3 different users: machine learning admin, machine learning user and a user not privileged to use ML at all.

I'm pretty sure the reason it's so slow and has got slower over the years is the complex setup and teardown that is done for each of the hundreds of tests. We delete all the internal indices in between tests, including the .security-6 index, and that means the indices have to be recreated at the beginning of every test. In addition, before deleting the indices at the end of each test we wait for any in-flight indexing to finish.

These same problems also affect watcher and monitoring REST tests, but there are fewer of them so it's not as noticeable.

It's possible that this could be speeded up by having a more streamlined setup and teardown process between the tests. But we need to be careful here - in 2017 we were plagued by X-Pack REST tests failing because of side effects of previous REST tests. We prevented these by having the complex setup and teardown steps we have today. However maybe we can cut back on the setup and teardown time by making it more specific to different test suites. For example, instead of deleting every index including .security-6 between tests maybe the ML security tests could leave the same security index in place for the whole suite and just remove ML indices between tests. I think the first step is to do a more detailed analysis of where the time is being spent so that we don't waste time optimizing something that makes little difference.

davidkyle · 2019-01-24T13:37:58Z

@atorok

tests that inherit from ESIntegTestCase can be ran in parallel

After removing the ESRestTestCase tests from this task how do I run the remaining ESIntegTestCase tests in parallel?

These tests create a 3 node cluster, which is one reason why they are so slow. Is it possible to parallelise this as I assume that would involve multiple clusters each with 3 jvms

alpar-t · 2019-01-25T08:44:50Z

The test test plugin configures parallelism to 1. The build plugin configures it to parallelism System.getProperty('tests.jvms', project.rootProject.ext.defaultParallel), that's what you'll need to set it to to have the tests run in parallel. We have some plans to consolidate the way testing tasks are configured and have this split across all the projects for the reasons outlined here, so then project will get properly configured tasks by default, I think that duplicating the line from the build plugin is ok in the interim.

droberts195 · 2019-02-15T17:41:39Z

The slowest test in a local test I just ran was DeleteExpiredDataIT.testDeleteExpiredData:

HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:19, stalled for 11.4s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:29, stalled for 21.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:39, stalled for 31.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:49, stalled for 41.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:59, stalled for 51.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:09, stalled for 61.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:19, stalled for 71.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:30, stalled for 81.6s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:40, stalled for 91.6s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:50, stalled for  102s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:00, stalled for  112s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:10, stalled for  122s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:20, stalled for  132s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:30, stalled for  142s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:40, stalled for  152s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:50, stalled for  162s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:54:00, stalled for  172s at: DeleteExpiredDataIT.testDeleteExpiredData

Looking in the server log the following gaps exist:

26 seconds:

[2019-02-15T16:51:38,349][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [results-retention] [autodetect/7687] [CForecastRunner.cc@213] Finished forecasting, wrote 3 records
[2019-02-15T16:52:04,151][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [results-and-snapshots-retention]

8 seconds:

[2019-02-15T16:52:14,821][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [results-and-snapshots-retention] [autodetect/7707] [CForecastRunner.cc@213] Finished forecasting, wrote 3 records
[2019-02-15T16:52:22,413][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [results-retention] Datafeed started (from: 2019-02-14T16:51:11.032Z to: 2019-02-15T16:51:11.680Z) with frequency [600000ms]

11 seconds:

[2019-02-15T16:52:25,817][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [results-retention] job closed
[2019-02-15T16:52:36,812][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [results-and-snapshots-retention] Datafeed started (from: 2019-02-14T16:51:11.032Z to: 2019-02-15T16:51:11.680Z) with frequency [600000ms]

23 seconds followed by 49(!) seconds:

[2019-02-15T16:52:40,186][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [results-and-snapshots-retention] job closed
[2019-02-15T16:53:03,304][WARN ][o.e.t.TransportService   ] [node-0] Received response for a request that has timed out, sent [20758ms] ago, timed out [10873ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node-1}{yJVfvhFtSTKhJp7_UfC2MA}{cXgJEMLPQuWxyOVbjA0_4A}{127.0.0.1}{127.0.0.1:54779}{testattr=test, ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}], id [55990]
[2019-02-15T16:53:52,882][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [node-0] Deleting expired data

dimitris-athanasiou · 2019-02-15T17:46:26Z

This test was certainly not that slow when it was first written. Not sure if the test changed or it just takes slower to run.

davidkyle · 2019-02-28T11:48:56Z

I had a PR failure due to the slow DeleteExpiredDataIT.testDeleteExpiredData test. Something has changed recently as the test consistently takes over 2 mins to run and that wasn't the case when this issue was opened. One potential gain is that the test runs the jobs sequentially when they could be run in parallel

dimitris-athanasiou · 2019-02-28T11:57:06Z

We should certainly spend some time understanding why it takes more time. I'll check it out.

This commit parallelizes some parts of the test and its remove an unnecessary refresh call. On my local machine it shaves off about 15 seconds. This test is still slow but progress over perfection. Relates elastic#37339

This commit parallelizes some parts of the test and its remove an unnecessary refresh call. On my local machine it shaves off about 15 seconds for a test execution time of ~64s (down from ~80s). This test is still slow but progress over perfection. Relates elastic#37339

This commit parallelizes some parts of the test and its remove an unnecessary refresh call. On my local machine it shaves off about 15 seconds for a test execution time of ~64s (down from ~80s). This test is still slow but progress over perfection. Relates #37339

droberts195 · 2019-03-01T15:55:32Z

I had a PR failure due to the slow DeleteExpiredDataIT.testDeleteExpiredData test. Something has changed recently

It's related to type removal. I started looking at the failure of DeleteExpiredDataIT in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/272/console. The logs from the cluster nodes show 10010 errors like this:

[2019-03-01T01:49:34,811][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [node-0] failed to put mappings on indices [[[.ml-state/KwwtpaltT8u3L4g_ge209Q]]], type [doc]
java.lang.IllegalArgumentException: Rejecting mapping update to [.ml-state] as the final mapping would have more than 1 type: [_doc, doc]
    at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:449) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:398) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:331) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:315) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Simply logging all those errors must dramatically slow the test down.

The test is using a hardcoded type of doc in the master branch where it needs to be using _doc. I'll open a PR to change the multi-node integration tests to use typeless APIs in master.

This commit parallelizes some parts of the test and its remove an unnecessary refresh call. On my local machine it shaves off about 15 seconds for a test execution time of ~64s (down from ~80s). This test is still slow but progress over perfection. Relates #37339

droberts195 · 2019-03-11T11:21:31Z

The problems with type removal logging and audit logging slowing down the tests are now resolved. (These problems appeared after this issue was first raised, so don't justify closing this issue.)

Returning to the fundamental question of what to do about ML integration tests that take a long time, there is more discussion in #39859 (comment).

alpar-t · 2019-03-12T09:52:38Z

@droberts195 both propositions seem too drastic to me right now.
We want to keep the assumption that check runs all tests except extended bwc.
Packaging tests are only separate due to hardware requirements.
Nor should you feel obligated to move these elsewhere.
This isn't the only thing that's slow and we're working on other ways to improve the build speed.
These were some of the slowest tests I'we seen and that made me think there might be low hanging fruits to pick in terms of speeding them up in terms of how the tests are implemented run etc. If that's not the case and these can't simply be made to go faster we'll need to look at other solutions as whole for the build in terms of delaying some tests etc. not specifically for ML

alpar-t added >non-issue v7.0.0 :ml Machine learning labels Jan 11, 2019

davidkyle mentioned this issue Jan 24, 2019

[ML] Marginal gains in slow multi node tests #37825

Merged

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

dimitris-athanasiou mentioned this issue Mar 1, 2019

[ML] Shave off DeleteExpiredDataIT runtime #39557

Merged

droberts195 mentioned this issue Mar 1, 2019

[ML] Switch to typeless APIs in ML native multi-node tests #39574

Merged

droberts195 mentioned this issue Mar 1, 2019

DeleteExpiredDataIT.testDeleteExpiredData failure #39575

Closed

droberts195 mentioned this issue Mar 11, 2019

[ML] Fix datafeed skipping first bucket after lookback when aggs are … #39859

Merged

droberts195 mentioned this issue Jul 5, 2021

[ML] Avoid 5s wait in AbstractNativeProcessTests #74916

Merged

davidkyle closed this as completed in #74916 Jul 5, 2021

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML QA tasks are slow, take up 20 minutes of build time #37339

ML QA tasks are slow, take up 20 minutes of build time #37339

alpar-t commented Jan 11, 2019

elasticmachine commented Jan 11, 2019

hendrikmuhs commented Jan 11, 2019

alpar-t commented Jan 11, 2019

droberts195 commented Jan 11, 2019

davidkyle commented Jan 24, 2019

alpar-t commented Jan 25, 2019

droberts195 commented Feb 15, 2019

dimitris-athanasiou commented Feb 15, 2019

davidkyle commented Feb 28, 2019

dimitris-athanasiou commented Feb 28, 2019

droberts195 commented Mar 1, 2019

droberts195 commented Mar 11, 2019

alpar-t commented Mar 12, 2019

ML QA tasks are slow, take up 20 minutes of build time #37339

ML QA tasks are slow, take up 20 minutes of build time #37339

Comments

alpar-t commented Jan 11, 2019

elasticmachine commented Jan 11, 2019

hendrikmuhs commented Jan 11, 2019

alpar-t commented Jan 11, 2019

droberts195 commented Jan 11, 2019

davidkyle commented Jan 24, 2019

alpar-t commented Jan 25, 2019

droberts195 commented Feb 15, 2019

dimitris-athanasiou commented Feb 15, 2019

davidkyle commented Feb 28, 2019

dimitris-athanasiou commented Feb 28, 2019

droberts195 commented Mar 1, 2019

droberts195 commented Mar 11, 2019

alpar-t commented Mar 12, 2019