Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML QA tasks are slow, take up 20 minutes of build time #37339

Closed
alpar-t opened this issue Jan 11, 2019 · 13 comments · Fixed by #74916
Closed

ML QA tasks are slow, take up 20 minutes of build time #37339

alpar-t opened this issue Jan 11, 2019 · 13 comments · Fixed by #74916
Labels

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jan 11, 2019

Looking at this build scan from a master intake CI job:
https://scans.gradle.com/s/wqhe4ax2bx7zc/performance/execution

It looks like 2 of the ML test tasks take up 20 minutes of our non parallelized section of the build:

:x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner	12m 53.964s	 	 
:x-pack:plugin:ml:qa:ml-with-security:integTestRunner         	        8m 41.424s	 	 

From the same logs:

:x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner
==> Test Info: seed=BAC54799DFF31721; jvm=1; suites=22

Slow Tests Summary:
--
146.90s \| org.elasticsearch.xpack.ml.integration.DatafeedJobsRestIT
131.64s \| org.elasticsearch.xpack.ml.integration.ForecastIT
83.96s \| org.elasticsearch.xpack.ml.integration.MlUpgradeIT
71.78s \| org.elasticsearch.xpack.ml.integration.DatafeedJobsIT
63.10s \| org.elasticsearch.xpack.ml.integration.DeleteExpiredDataIT

and

:x-pack:plugin:ml:qa:ml-with-security:integTestRunner
--
==> Test Info: seed=4D0F3BB1F5019D2C; jvm=1; suites=3
==> Test Summary: 3 suites, 798 tests, 225 ignored (225 assumptions)

We should look into why these tests are so slow and ways to make them faster.
One things that sticks out is that there are both ESIntegTestCase and ESRestTestCase mixed in the same tasks. The former can be spread out across multiple JVMs running in parallel if we were to have a dedicated task for those.

@alpar-t alpar-t added >non-issue v7.0.0 :ml Machine learning labels Jan 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@hendrikmuhs
Copy link
Contributor

I guess the slow part of ForecastIT is the one that creates a lot of data on the fly to trigger a big ml model. Optimization is always possible, but I doubt we can make it significantly faster. I do not want to change anything on the test itself, it's important that we test this code path which triggers special handling (huge model that requires overflow to disk).

Having that said, I can move the slow parts into a separate package if that helps, I guess that's what you mean with "dedicated task".

A followup option could be to split ForecastIT into separate classes, while the heavy ones are only executed if a certain criteria is met ("something with ML"). The quick tests should still always execute I think.

@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 11, 2019

We are not generally thinking about a slow vs fast split for testing yet and we shouldn't weaken the test to make it faster. We could maybe structure it so we can run more of them in parallel ? This is what I meant by a separate task, tests that inherit from ESIntegTestCase can be ran in parallel. We run ESRestTestCase in sequence because they run against the same cluster, but depending on what the tests are doing we might be able run more in parallel of those too ?

@droberts195
Copy link
Contributor

:x-pack:plugin:ml:qa:ml-with-security is made up of hundreds of small tests. It re-runs all the ML REST tests with 3 different users: machine learning admin, machine learning user and a user not privileged to use ML at all.

I'm pretty sure the reason it's so slow and has got slower over the years is the complex setup and teardown that is done for each of the hundreds of tests. We delete all the internal indices in between tests, including the .security-6 index, and that means the indices have to be recreated at the beginning of every test. In addition, before deleting the indices at the end of each test we wait for any in-flight indexing to finish.

These same problems also affect watcher and monitoring REST tests, but there are fewer of them so it's not as noticeable.

It's possible that this could be speeded up by having a more streamlined setup and teardown process between the tests. But we need to be careful here - in 2017 we were plagued by X-Pack REST tests failing because of side effects of previous REST tests. We prevented these by having the complex setup and teardown steps we have today. However maybe we can cut back on the setup and teardown time by making it more specific to different test suites. For example, instead of deleting every index including .security-6 between tests maybe the ML security tests could leave the same security index in place for the whole suite and just remove ML indices between tests. I think the first step is to do a more detailed analysis of where the time is being spent so that we don't waste time optimizing something that makes little difference.

@davidkyle
Copy link
Member

@atorok

tests that inherit from ESIntegTestCase can be ran in parallel

After removing the ESRestTestCase tests from this task how do I run the remaining ESIntegTestCase tests in parallel?

These tests create a 3 node cluster, which is one reason why they are so slow. Is it possible to parallelise this as I assume that would involve multiple clusters each with 3 jvms

@alpar-t
Copy link
Contributor Author

alpar-t commented Jan 25, 2019

The test test plugin configures parallelism to 1. The build plugin configures it to parallelism System.getProperty('tests.jvms', project.rootProject.ext.defaultParallel), that's what you'll need to set it to to have the tests run in parallel. We have some plans to consolidate the way testing tasks are configured and have this split across all the projects for the reasons outlined here, so then project will get properly configured tasks by default, I think that duplicating the line from the build plugin is ok in the interim.

@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@droberts195
Copy link
Contributor

The slowest test in a local test I just ran was DeleteExpiredDataIT.testDeleteExpiredData:

HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:19, stalled for 11.4s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:29, stalled for 21.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:39, stalled for 31.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:49, stalled for 41.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:51:59, stalled for 51.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:09, stalled for 61.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:19, stalled for 71.5s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:30, stalled for 81.6s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:40, stalled for 91.6s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:52:50, stalled for  102s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:00, stalled for  112s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:10, stalled for  122s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:20, stalled for  132s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:30, stalled for  142s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:40, stalled for  152s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:53:50, stalled for  162s at: DeleteExpiredDataIT.testDeleteExpiredData
HEARTBEAT J0 PID(7293@Davids-MacBook-Pro-7.local): 2019-02-15T16:54:00, stalled for  172s at: DeleteExpiredDataIT.testDeleteExpiredData

Looking in the server log the following gaps exist:

26 seconds:

[2019-02-15T16:51:38,349][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [results-retention] [autodetect/7687] [CForecastRunner.cc@213] Finished forecasting, wrote 3 records
[2019-02-15T16:52:04,151][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [results-and-snapshots-retention]

8 seconds:

[2019-02-15T16:52:14,821][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [results-and-snapshots-retention] [autodetect/7707] [CForecastRunner.cc@213] Finished forecasting, wrote 3 records
[2019-02-15T16:52:22,413][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [results-retention] Datafeed started (from: 2019-02-14T16:51:11.032Z to: 2019-02-15T16:51:11.680Z) with frequency [600000ms]

11 seconds:

[2019-02-15T16:52:25,817][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [results-retention] job closed
[2019-02-15T16:52:36,812][INFO ][o.e.x.m.d.DatafeedJob    ] [node-0] [results-and-snapshots-retention] Datafeed started (from: 2019-02-14T16:51:11.032Z to: 2019-02-15T16:51:11.680Z) with frequency [600000ms]

23 seconds followed by 49(!) seconds:

[2019-02-15T16:52:40,186][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-0] [results-and-snapshots-retention] job closed
[2019-02-15T16:53:03,304][WARN ][o.e.t.TransportService   ] [node-0] Received response for a request that has timed out, sent [20758ms] ago, timed out [10873ms] ago, action [internal:coordination/fault_detection/follower_check], node [{node-1}{yJVfvhFtSTKhJp7_UfC2MA}{cXgJEMLPQuWxyOVbjA0_4A}{127.0.0.1}{127.0.0.1:54779}{testattr=test, ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}], id [55990]
[2019-02-15T16:53:52,882][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [node-0] Deleting expired data

@dimitris-athanasiou
Copy link
Contributor

This test was certainly not that slow when it was first written. Not sure if the test changed or it just takes slower to run.

@davidkyle
Copy link
Member

I had a PR failure due to the slow DeleteExpiredDataIT.testDeleteExpiredData test. Something has changed recently as the test consistently takes over 2 mins to run and that wasn't the case when this issue was opened. One potential gain is that the test runs the jobs sequentially when they could be run in parallel

@dimitris-athanasiou
Copy link
Contributor

We should certainly spend some time understanding why it takes more time. I'll check it out.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Mar 1, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds.
This test is still slow but progress over perfection.

Relates elastic#37339
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Mar 1, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates elastic#37339
dimitris-athanasiou added a commit that referenced this issue Mar 1, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates #37339
@droberts195
Copy link
Contributor

I had a PR failure due to the slow DeleteExpiredDataIT.testDeleteExpiredData test. Something has changed recently

It's related to type removal. I started looking at the failure of DeleteExpiredDataIT in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/272/console. The logs from the cluster nodes show 10010 errors like this:

[2019-03-01T01:49:34,811][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [node-0] failed to put mappings on indices [[[.ml-state/KwwtpaltT8u3L4g_ge209Q]]], type [doc]
java.lang.IllegalArgumentException: Rejecting mapping update to [.ml-state] as the final mapping would have more than 1 type: [_doc, doc]
    at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:449) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:398) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:331) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:315) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Simply logging all those errors must dramatically slow the test down.

The test is using a hardcoded type of doc in the master branch where it needs to be using _doc. I'll open a PR to change the multi-node integration tests to use typeless APIs in master.

dimitris-athanasiou added a commit that referenced this issue Mar 1, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates #37339
dimitris-athanasiou added a commit that referenced this issue Mar 1, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates #37339
dimitris-athanasiou added a commit that referenced this issue Mar 2, 2019
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates #37339
@droberts195
Copy link
Contributor

The problems with type removal logging and audit logging slowing down the tests are now resolved. (These problems appeared after this issue was first raised, so don't justify closing this issue.)

Returning to the fundamental question of what to do about ML integration tests that take a long time, there is more discussion in #39859 (comment).

@alpar-t
Copy link
Contributor Author

alpar-t commented Mar 12, 2019

@droberts195 both propositions seem too drastic to me right now.
We want to keep the assumption that check runs all tests except extended bwc.
Packaging tests are only separate due to hardware requirements.
Nor should you feel obligated to move these elsewhere.
This isn't the only thing that's slow and we're working on other ways to improve the build speed.
These were some of the slowest tests I'we seen and that made me think there might be low hanging fruits to pick in terms of speeding them up in terms of how the tests are implemented run etc. If that's not the case and these can't simply be made to go faster we'll need to look at other solutions as whole for the build in terms of delaying some tests etc. not specifically for ML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants