Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045

Merged
merged 16 commits into from
Jul 9, 2019

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Jun 10, 2019

This PR introduces DatafeedTimingStats class which will hold various timing-related metrics useful for debugging.
Initially, the only metric available is totalSearchTimeMs which accumulates total time spent by datafeed searching indices.

Related issue: #29857

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@przemekwitek przemekwitek force-pushed the datafeed-timing-stats branch 4 times, most recently from bc22def to 57d3fbe Compare June 17, 2019 13:55
@przemekwitek przemekwitek marked this pull request as ready for review June 17, 2019 14:04
@benwtrent benwtrent self-requested a review June 19, 2019 12:22
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All datafeed BWC tests are muted still (I think) due to the changes to the date_histogram aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.

@przemekwitek przemekwitek changed the title Add TimingStats to datafeed GetDatafeedStatsAction.Response Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response Jun 24, 2019
Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All datafeed BWC tests are muted still (I think) due to the changes to the date_histogram aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.

I'm not sure I understand. Which 'streaming issues' do you have in mind?

@benwtrent
Copy link
Member

@przemekwitek this is the streaming issue I was talking about: https://github.com/elastic/elasticsearch/pull/43045/files#diff-34cde95ae0dc7e6c0e10460382efb116R151

This code does not exist in 7.3.0 yet, but this stream from a 7.3.0 node is expected to have this format.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments about not changing the rules on which indices each job's results could be in. I think this is important because its complicated enough at the moment and we have a plan to make it easier for users to set up rolling results indices.

@przemekwitek
Copy link
Contributor Author

@przemekwitek this is the streaming issue I was talking about: https://github.com/elastic/elasticsearch/pull/43045/files#diff-34cde95ae0dc7e6c0e10460382efb116R151

This code does not exist in 7.3.0 yet, but this stream from a 7.3.0 node is expected to have this format.

Do you think it will be ok if I just backport this PR to 7.3 short after it is merged to master?
Or do I need to take extra steps not to break CI?

@benwtrent
Copy link
Member

@przemekwitek all the datafeed bwc tests are currently muted I due to another issue. I would just make sure they are muted before backport. If your BWC passes here in the PR, then it should be fine to merge.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those doubles should not be optional in the stream in/out methods.

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/2

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two last questions around design:

  • Why aren't we using the search time returned in the search payload?
  • Should we restart the states when a data feed is deleted and recreated but the job remains the same?

public <T, R> R executeWithReporting(Function<T, R> function, T argument) {
Instant before = clock.instant();
R result = function.apply(argument);
Instant after = clock.instant();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is an interesting predicament. I can understand the reason for wanting "wall clock time" for the entire function, but strictly speaking, this is NOT the search time. The search time is returned in the search payload and could gathered from there.

Though I also see the benefit of having the entire method wrapped.

Unsure if the complexity is necessary though. Statistically, both the real search time, and the time spent in this method should have 100% correlation and give the same indication for run time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Thanks for this suggestion Ben. The interface of DatafeedTimingStatsReporter is much cleaner now.

.setSize(1)
.setIndicesOptions(IndicesOptions.lenientExpandOpen())
.setQuery(QueryBuilders.idsQuery().addIds(DatafeedTimingStats.documentId(jobId)))
.addSort(SortBuilders.fieldSort(DatafeedTimingStats.TOTAL_SEARCH_TIME_MS.getPreferredName()).order(SortOrder.DESC));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is latest. I thought TOTAL_SEARCH_TIME_MS was a summation of the individual search times.

Ah, I see, the assumption that it is always monotonically increasing.

Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.

If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.

If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.

As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index. Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.

@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is latest. I thought TOTAL_SEARCH_TIME_MS was a summation of the individual search times.

Ah, I see, the assumption that it is always monotonically increasing.

That's correct.
To make it more explicit I could introduce another field, last_modification_timestamp. BTW, this sounds like something every document could benefit from. Is such a functionality built-in or does such a field has to be added on per-index basis?

Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.

If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.

If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.

Good point, so far I was treating job-datafeed pair as a unity so I have to revisit that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.

As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index.

Correct, I did not want to introduce yet another index, especially that we do not plan to foster job-datafeed divergence.

Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.

@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.

I'll do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done.

PTAL

@sophiec20 sophiec20 changed the title Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response [ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response Jun 28, 2019
@jpountz jpountz added the v7.4.0 label Jul 3, 2019
* Use DeleteByQueryAction to delete DatafeedTimingStats
* Refine the new tests in DatafeedJobsIT test suite
- match: { datafeeds.0.datafeed_id: "datafeed-1"}
- match: { datafeeds.0.state: "stopped"}
- match: { datafeeds.0.timing_stats.job_id: "get-datafeed-stats-1"}
- match: { datafeeds.0.timing_stats.search_count: 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something that ensures that the timing stats document is searchable in the index after stopping a datafeed? I couldn't see this but maybe I'm just missing it (in which case the rest of this comment is irrelevant).

In other words, what stops the following sequence of events?

  1. Datafeed is started
  2. Datafeed does a search
  3. Datafeed indexes a timing stats document, which sits in the translog and is not yet committed to Lucene
  4. Datafeed is stopped
  5. Get datafeed stats searches the index, and does not find the most recent timing stats document containing details of the search
  6. Translog is flushed and the latest timing stats becomes searchable

The test might be asserting in between steps 5 and 6. Of course, it's possible that the document becomes searchable before the get datafeed stats request runs, so then the test will pass.

There are two approaches for solving this:

  1. On stopping a datafeed persist the most recent timing stats again with refresh=immediate.
  2. Use refresh=immediate every time datafeed timing stats are persisted.

I prefer option 1 but if it's really hard then option 2 is probably OK as the volume is not that high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to make this (already big) PR any bigger so I implemented approach 2. and added TODO to revisit in the future.
Is that ok?

BTW, I was not aware of the translog

datafeedConfigBuilder -> {
String jobId = datafeedConfigBuilder.build().getJobId();
JobDataDeleter jobDataDeleter = new JobDataDeleter(client, jobId);
jobDataDeleter.deleteDatafeedTimingStats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timing stats should only be deleted if the job ID is changed. As it stands some small change like increasing the query delay slightly will reset the timing stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
Done + added test

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for all the iterations. Let's merge this and do any further tweaks in the followup PR that adds an average search time per bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants