Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Align transform checkpoint range with date_histogram interval for better performance #74004

Merged

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Jun 10, 2021

This PR makes the checkpoint boundaries aligned with top-level date_histogram bucket boundaries.
The optimization is applied only when the transform config has date_histogram as the first group in the group_by list.

Relates #62746

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch 3 times, most recently from 48634ef to 944fea1 Compare June 15, 2021 06:28
@przemekwitek przemekwitek changed the title Transform align timestamps for perf [Transform] Align transform checkpoint range with date_histogram interval for better performance Jun 15, 2021
@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch 2 times, most recently from 6a65d22 to dc3a876 Compare June 18, 2021 08:27
@przemekwitek przemekwitek marked this pull request as ready for review June 18, 2021 09:18
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this an optional feature and add a setting (hello, naming discussion ;-) ). The default should be to write incomplete/interim buckets as we do today. We might identify some cases where we always want to align to bucket boundaries, e.g. if frequency and bucket boundaries are small (e.g for fixed intervals with less than 60s it might make sense to align per default). To be discussed, we should give users some guidance about when this is useful.

IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.

Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be discussed, we should give users some guidance about when this is useful.

I had an impression that we want to free the user from the responsibility of enabling this optimization themselves.
Do you envision situations where this optimization is harmful for the user to the point they want to disable it?
Or is it more of a safety measure so that the user has a way out if the optimization doesn't work for them for any unforseen reason?

IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.

Ok, makes sense.

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch from dc3a876 to dda0983 Compare June 21, 2021 09:15
@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch 3 times, most recently from 5bd2af5 to 6491904 Compare July 5, 2021 14:18
@przemekwitek
Copy link
Contributor Author

We should make this an optional feature and add a setting

Done.

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch from 4846ae9 to 785a706 Compare July 12, 2021 11:26
Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 2 comments / questions,

I have a suspicion:

The checkpoint provider is passed in as argument to the indexer and it's also final. That means changing the setting via _update won't be applied, but you have to stop and start the transform to force a reload of the checkpoint provider.

If I am right, this is an existing bug. However we could leave it aside for this PR, some re-factoring is probably required. Nevertheless I think we should fix it for this release cycle.

Map.Entry<String, SingleGroupSource> topLevelGroupEntry = groups.entrySet().iterator().next();
if ((topLevelGroupEntry.getValue() instanceof DateHistogramGroupSource) == false) {
return timestamp;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the reason for requiring the top grouping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under impression that the date histogram grouping must be the first (top) specified in order to split all data into time buckets properly.
If it's not first then some other buckets (e.g.: terms) that are first will make the date histogram buckets mixed.
Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch from 785a706 to c8c34d7 Compare July 14, 2021 14:02
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the PR description Align transform checkpoint range with date_histogram interval for better performance I understand what that means but I'm not sure what interim_results means in this context. Perhaps write_partial | incomplete | interim_buckets or align_checkpoints or just make apply the optimisation automatically and remove the config option.

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch 2 times, most recently from 5a605ea to 721d6cf Compare July 21, 2021 08:15
Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the PR description Align transform checkpoint range with date_histogram interval for better performance I understand what that means but I'm not sure what interim_results means in this context. Perhaps write_partial | incomplete | interim_buckets or align_checkpoints

interim_results is the setting name we came up with during the discussion although I agree it is not perfect. We can of course revisit that.

or just make apply the optimisation automatically and remove the config option.

This is the option we discarded with the assumption there will be users that don't want the optimization for any reason.

Map.Entry<String, SingleGroupSource> topLevelGroupEntry = groups.entrySet().iterator().next();
if ((topLevelGroupEntry.getValue() instanceof DateHistogramGroupSource) == false) {
return timestamp;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch from 4cc0747 to 2d788ae Compare August 16, 2021 08:19
@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc

1 similar comment
@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/bwc

@przemekwitek przemekwitek force-pushed the transform_align_timestamps_for_perf branch from 4cca62f to a575b77 Compare August 16, 2021 14:59
@przemekwitek
Copy link
Contributor Author

interim_results is the setting name we came up with during the discussion although I agree it is not perfect. We can of course revisit that.

I'm merging this with interim_results as the PR is functionally complete. I'll do the remaining in the follow-up PR.

@przemekwitek przemekwitek merged commit f9d30ad into elastic:master Aug 16, 2021
@przemekwitek przemekwitek deleted the transform_align_timestamps_for_perf branch August 16, 2021 15:50
przemekwitek added a commit that referenced this pull request Aug 16, 2021
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Aug 17, 2021
* master: (868 commits)
  Query API key - Rest spec and yaml tests (elastic#76238)
  Delay shard reassignment from nodes which are known to be restarting (elastic#75606)
  Reenable bwc tests for elastic#76475 (elastic#76576)
  Set version to 7.15 in BWC code (elastic#76577)
  Don't remove warning headers on all failure (elastic#76434)
  Disable bwc tests for elastic#76475 (elastic#76541)
  Re-enable bwc tests (elastic#76567)
  Keep track of data recovered from snapshots in RecoveryState (elastic#76499)
  [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004)
  EQL: Remove "wildcard" function (elastic#76099)
  Fix 'accept' and 'content_type' fields for search_mvt API
  Add persistent licensed feature tracking (elastic#76476)
  Add system data streams to feature state snapshots (elastic#75902)
  fix the error message for instance methods that don't exist (elastic#76512)
  ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219)
  remove dashboard only reserved role (elastic#76507)
  Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480)
  Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634)
  Add recovery from snapshot to tests (elastic#76535)
  Reenable BwC Tests after elastic#76532 (elastic#76534)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants