[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

przemekwitek · 2019-06-10T14:04:07Z

This PR introduces DatafeedTimingStats class which will hold various timing-related metrics useful for debugging.
Initially, the only metric available is totalSearchTimeMs which accumulates total time spent by datafeed searching indices.

Related issue: #29857

elasticmachine · 2019-06-10T14:04:43Z

Pinging @elastic/ml-core

benwtrent

All datafeed BWC tests are muted still (I think) due to the changes to the date_histogram aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java

przemekwitek

All datafeed BWC tests are muted still (I think) due to the changes to the date_histogram aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.

I'm not sure I understand. Which 'streaming issues' do you have in mind?

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java

benwtrent · 2019-06-24T14:07:30Z

@przemekwitek this is the streaming issue I was talking about: https://github.com/elastic/elasticsearch/pull/43045/files#diff-34cde95ae0dc7e6c0e10460382efb116R151

This code does not exist in 7.3.0 yet, but this stream from a 7.3.0 node is expected to have this format.

droberts195

I left a few comments about not changing the rules on which indices each job's results could be in. I think this is important because its complicated enough at the moment and we have a plan to make it easier for users to set up rolling results indices.

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedTimingStatsReporter.java

...ore/src/main/java/org/elasticsearch/xpack/core/ml/job/persistence/AnomalyDetectorsIndex.java

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsProvider.java

przemekwitek · 2019-06-26T06:56:19Z

@przemekwitek this is the streaming issue I was talking about: https://github.com/elastic/elasticsearch/pull/43045/files#diff-34cde95ae0dc7e6c0e10460382efb116R151

This code does not exist in 7.3.0 yet, but this stream from a 7.3.0 node is expected to have this format.

Do you think it will be ok if I just backport this PR to 7.3 short after it is merged to master?
Or do I need to take extra steps not to break CI?

benwtrent · 2019-06-26T15:45:19Z

@przemekwitek all the datafeed bwc tests are currently muted I due to another issue. I would just make sure they are muted before backport. If your BWC passes here in the PR, then it should be fine to merge.

benwtrent

those doubles should not be optional in the stream in/out methods.

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java

przemekwitek · 2019-06-27T09:05:44Z

run elasticsearch-ci/2

benwtrent

Two last questions around design:

Why aren't we using the search time returned in the search payload?
Should we restart the states when a data feed is deleted and recreated but the job remains the same?

benwtrent · 2019-06-27T13:04:28Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedTimingStatsReporter.java

+    public <T, R> R executeWithReporting(Function<T, R> function, T argument) {
+        Instant before = clock.instant();
+        R result = function.apply(argument);
+        Instant after = clock.instant();


So, this is an interesting predicament. I can understand the reason for wanting "wall clock time" for the entire function, but strictly speaking, this is NOT the search time. The search time is returned in the search payload and could gathered from there.

Though I also see the benefit of having the entire method wrapped.

Unsure if the complexity is necessary though. Statistically, both the real search time, and the time spent in this method should have 100% correlation and give the same indication for run time.

Done.
Thanks for this suggestion Ben. The interface of DatafeedTimingStatsReporter is much cleaner now.

benwtrent · 2019-06-27T13:12:05Z

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsProvider.java

+            .setSize(1)
+            .setIndicesOptions(IndicesOptions.lenientExpandOpen())
+            .setQuery(QueryBuilders.idsQuery().addIds(DatafeedTimingStats.documentId(jobId)))
+            .addSort(SortBuilders.fieldSort(DatafeedTimingStats.TOTAL_SEARCH_TIME_MS.getPreferredName()).order(SortOrder.DESC));


I don't understand how this is latest. I thought TOTAL_SEARCH_TIME_MS was a summation of the individual search times.

Ah, I see, the assumption that it is always monotonically increasing.

Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.

If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.

If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.

This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.

As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index. Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.

@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.

I don't understand how this is latest. I thought TOTAL_SEARCH_TIME_MS was a summation of the individual search times.

Ah, I see, the assumption that it is always monotonically increasing.

That's correct.
To make it more explicit I could introduce another field, last_modification_timestamp. BTW, this sounds like something every document could benefit from. Is such a functionality built-in or does such a field has to be added on per-index basis?

Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.

If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.

If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.

Good point, so far I was treating job-datafeed pair as a unity so I have to revisit that.

This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.

As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index.

Correct, I did not want to introduce yet another index, especially that we do not plan to foster job-datafeed divergence.

Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.

@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.

I'll do that.

This is now done.

PTAL

…imingStatsReporter where it logically belongs

…mpty job list

…time

* Use DeleteByQueryAction to delete DatafeedTimingStats

* Refine the new tests in DatafeedJobsIT test suite

droberts195 · 2019-07-09T10:10:24Z

x-pack/plugin/src/test/resources/rest-api-spec/test/ml/get_datafeed_stats.yml

+  - match: { datafeeds.0.datafeed_id: "datafeed-1"}
+  - match: { datafeeds.0.state: "stopped"}
+  - match: { datafeeds.0.timing_stats.job_id: "get-datafeed-stats-1"}
+  - match: { datafeeds.0.timing_stats.search_count: 1}


Is there something that ensures that the timing stats document is searchable in the index after stopping a datafeed? I couldn't see this but maybe I'm just missing it (in which case the rest of this comment is irrelevant).

In other words, what stops the following sequence of events?

Datafeed is started

Datafeed does a search

Datafeed indexes a timing stats document, which sits in the translog and is not yet committed to Lucene

Datafeed is stopped

Get datafeed stats searches the index, and does not find the most recent timing stats document containing details of the search

Translog is flushed and the latest timing stats becomes searchable

The test might be asserting in between steps 5 and 6. Of course, it's possible that the document becomes searchable before the get datafeed stats request runs, so then the test will pass.

There are two approaches for solving this:

On stopping a datafeed persist the most recent timing stats again with refresh=immediate.

Use refresh=immediate every time datafeed timing stats are persisted.

I prefer option 1 but if it's really hard then option 2 is probably OK as the volume is not that high.

I didn't want to make this (already big) PR any bigger so I implemented approach 2. and added TODO to revisit in the future.
Is that ok?

BTW, I was not aware of the translog

droberts195 · 2019-07-09T10:19:46Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportUpdateDatafeedAction.java

+                    datafeedConfigBuilder -> {
+                        String jobId = datafeedConfigBuilder.build().getJobId();
+                        JobDataDeleter jobDataDeleter = new JobDataDeleter(client, jobId);
+                        jobDataDeleter.deleteDatafeedTimingStats(


I think the timing stats should only be deleted if the job ID is changed. As it stands some small change like increasing the query delay slightly will reset the timing stats.

Good catch.
Done + added test

droberts195

LGTM

Thanks for all the iterations. Let's merge this and do any further tweaks in the followup PR that adds an average search time per bucket.

…nse (elastic#43045)

….Response (#43045) (#44118)

przemekwitek added :ml Machine learning v7.3.0 v8.0.0 WIP labels Jun 10, 2019

przemekwitek force-pushed the datafeed-timing-stats branch 4 times, most recently from bc22def to 57d3fbe Compare June 17, 2019 13:55

przemekwitek added >enhancement and removed WIP labels Jun 17, 2019

przemekwitek force-pushed the datafeed-timing-stats branch from 57d3fbe to c2a7577 Compare June 17, 2019 14:04

przemekwitek marked this pull request as ready for review June 17, 2019 14:04

przemekwitek force-pushed the datafeed-timing-stats branch from 88f7335 to fa34a44 Compare June 19, 2019 10:40

benwtrent self-requested a review June 19, 2019 12:22

benwtrent reviewed Jun 19, 2019

View reviewed changes

przemekwitek force-pushed the datafeed-timing-stats branch from fa34a44 to 1862f74 Compare June 24, 2019 09:24

przemekwitek changed the title ~~Add TimingStats to datafeed GetDatafeedStatsAction.Response~~ Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response Jun 24, 2019

przemekwitek commented Jun 24, 2019

View reviewed changes

droberts195 reviewed Jun 25, 2019

View reviewed changes

przemekwitek force-pushed the datafeed-timing-stats branch from 95032cd to 6b2d9a7 Compare June 26, 2019 10:00

benwtrent reviewed Jun 26, 2019

View reviewed changes

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java Outdated Show resolved Hide resolved

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java Outdated Show resolved Hide resolved

benwtrent reviewed Jun 27, 2019

View reviewed changes

przemekwitek force-pushed the datafeed-timing-stats branch from 2617de0 to afc0875 Compare June 28, 2019 06:23

sophiec20 changed the title ~~Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response~~ [ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response Jun 28, 2019

jpountz added the v7.4.0 label Jul 3, 2019

przemekwitek added 12 commits July 8, 2019 11:04

Move differSignificantly method from DatafeedTimingStats to DatafeedT…

3dda61a

…imingStatsReporter where it logically belongs

Pass jobId instead of the whole jobIdByDatafeedId map

609f4d7

Apply review comments

e11e021

Add missing mappings

3e5ded9

Make datafeedTimingStats method behave correctly in the presence of e…

b7b4d5d

…mpty job list

Small cleanup

9488f83

Apply review comments

7f54e89

Small improvements

5a306e0

Report actual (server-side) search time rather than client-side wall …

8c54d93

…time

Delete DatafeedTimingStats document on Datafeed deletion

7e012c9

Delete DatafeedTimingStats document on Datafeed update

4d5ff2a

Fix tests

aa64029

przemekwitek force-pushed the datafeed-timing-stats branch from 70c4410 to aa64029 Compare July 8, 2019 09:38

* Move DatafeedTimingStats deleting logic to JobDataDeleter class

5bfc3ea

* Use DeleteByQueryAction to delete DatafeedTimingStats

benwtrent approved these changes Jul 8, 2019

View reviewed changes

* Add search_count field to HLRC

691bb69

* Refine the new tests in DatafeedJobsIT test suite

droberts195 reviewed Jul 9, 2019

View reviewed changes

Apply review comments

aa04e15

droberts195 approved these changes Jul 9, 2019

View reviewed changes

przemekwitek merged commit 1572080 into elastic:master Jul 9, 2019

przemekwitek deleted the datafeed-timing-stats branch July 9, 2019 14:07

przemekwitek mentioned this pull request Jul 9, 2019

[7.x] [ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response (#43045) #44118

Merged

przemekwitek added a commit to przemekwitek/elasticsearch that referenced this pull request Jul 9, 2019

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Respo…

e860c03

…nse (elastic#43045)

This was referenced Jul 9, 2019

[ML] New test fails on CI #44132

Closed

Relax the test assertion so that the test always passes #44133

Merged

przemekwitek added a commit that referenced this pull request Jul 10, 2019

[7.x] [ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction…

44781e4

….Response (#43045) (#44118)

codebrain mentioned this pull request Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

przemekwitek commented Jun 10, 2019 •

edited

elasticmachine commented Jun 10, 2019

benwtrent left a comment

przemekwitek left a comment

benwtrent commented Jun 24, 2019

droberts195 left a comment

przemekwitek commented Jun 26, 2019

benwtrent commented Jun 26, 2019

benwtrent left a comment

przemekwitek commented Jun 27, 2019

benwtrent left a comment

benwtrent Jun 27, 2019

przemekwitek Jun 28, 2019

benwtrent Jun 27, 2019

droberts195 Jun 27, 2019

przemekwitek Jun 28, 2019

przemekwitek Jun 28, 2019

przemekwitek Jul 8, 2019

droberts195 Jul 9, 2019

przemekwitek Jul 9, 2019

droberts195 Jul 9, 2019

przemekwitek Jul 9, 2019

droberts195 left a comment

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

Conversation

przemekwitek commented Jun 10, 2019 • edited

elasticmachine commented Jun 10, 2019

benwtrent left a comment

Choose a reason for hiding this comment

przemekwitek left a comment

Choose a reason for hiding this comment

benwtrent commented Jun 24, 2019

droberts195 left a comment

Choose a reason for hiding this comment

przemekwitek commented Jun 26, 2019

benwtrent commented Jun 26, 2019

benwtrent left a comment

Choose a reason for hiding this comment

przemekwitek commented Jun 27, 2019

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

przemekwitek commented Jun 10, 2019 •

edited