Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delayed datacheck to the datafeed job runner #35387

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Nov 8, 2018

This adds a latent data check to the DatafeedJobwhen it runs real-time jobs.

It can optionally be enabled via the DatafeedConfig.

Two new options added to the DatafeedConfig object, client side, and docs have been updated accordingly.

Couple of decision points I could not really settle on:

  • Should we set a default delayed_data_check_window if should_run_delayed_data_check is true?
  • Should we put sane limits on delayed_data_check_window outside of the current bucket length checks? If so, what are those sane limits?
  • Should we attempt to put the time of the missing data in the Audit? Maybe put the timestamp of the latest bucket missing data?
  • How often should we check for this missing data?

The integration tests are done within DelayedDataDetectorIT previously included, that covered the actual looking up of missing data and doing real queries, etc. against a cluster.

Consequently, I just mocked out DelayedDataDetector in the DatafeedJobTests class and thought that coverage adequate.

Screenshot of how it looks:

screen shot 2018-11-09 at 9 00 22 am

Relates to #35131

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM I have some general comments about the behaviour

@davidkyle
Copy link
Member

davidkyle commented Nov 8, 2018

Should we set a default delayed_data_check_window if should_run_delayed_data_check is true?

I say no. If we are using the model where the user has to set both params then it's a problem is both aren't set, otherwise the behaviour may be unexpected. I do think should_run_delayed_data_check should be true by default with a sensible value of delayed_data_check_window

Should we put sane limits on delayed_data_check_window outside of the current bucket length checks? If so, what are those sane limits?

With a 10 minute bucket the current value of 10,000 buckets is over 64 days. Did you consider making the limit an absolute time value rather than a function of the bucket length.

Should we attempt to put the time of the missing data in the Audit? Maybe put the timestamp of the latest bucket missing data?

Yes please.
Is there anything that stops the same message being audited again and again each time the check is made?

How often should we check for this missing data?

Less frequently 🤷‍♂️

@benwtrent
Copy link
Member Author

@davidkyle when it comes to a sane upper limit I can see benefits to either using a true TimeValue or a multiple of the bucket span.

I do realize that it is way too large right now, the current limit is dictated by the number of documents that are searchable at a time and not by a realistic use case.

Yes please.
Is there anything that stops the same message being audited again and again each time the check is made?

There is nothing preventing this right now...but if we store the latest timestamp of the bucket with missing docs && how many docs are missing overall, we could simply check if the values have changed and skip writing the message so that the audit log does not get spammed.

I do think should_run_delayed_data_check should be true by default with a sensible value of delayed_data_check_window

Any suggestions on a sensible value?

@benwtrent
Copy link
Member Author

@davidkyle Updated. Key updates

  • Interval check is now 15 minutes, or the data check window length, which ever is smaller. If the querydelay+frequency is longer than both, we will be doing a check every time.
  • Default values are 2 hours for window length, and true
  • Fixed some of the initialization of the DelayedDataChecker, I did not like how I was validating the window, so adjusted it so that it is never null, and is initialized in the constructor. This leads to clearer behavior.
  • Added the timestamp to the audit message and prevented audit spamming.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* The window must be larger than the {@link org.elasticsearch.client.ml.job.config.AnalysisConfig#bucketSpan}, less than
* 24 hours, and span less than 10,000x buckets.
*
* @param delayedDataCheckWindow The time length in the past from the latest finalized bucket to look for latent data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit missing full stop at the end of the sentence

@benwtrent
Copy link
Member Author

Jenkins retest this please

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few recommendations.

@benwtrent
Copy link
Member Author

Jenkins retest this please

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think this looks wonderful now, great work Ben!

@benwtrent
Copy link
Member Author

Jenkins retest this please

@benwtrent benwtrent merged commit f7ada9b into elastic:master Nov 15, 2018
@benwtrent benwtrent deleted the feature/ml-datafeed-job-check-missing-data branch November 15, 2018 19:32
benwtrent added a commit that referenced this pull request Nov 15, 2018
* ML: Adding missing datacheck to datafeedjob

* Adding client side and docs

* Making adjustments to validations

* Making values default to on, having more sensible limits

* Intermittent commit, still need to figure out interval

* Adjusting delayed data check interval

* updating docs

* Making parameter Boolean, so it is nullable

* bumping bwc to 7 before backport

* changing to version current

* moving delayed data check config its own object

* Separation of duties for delayed data detection

* fixing checkstyles

* fixing checkstyles

* Adjusting default behavior so that null windows are allowed

* Mentioning the default value

* Fixing comments, syncing up validations
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants