Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] continuous transform date_histogram group_by performance #54254

Closed
hendrikmuhs opened this issue Mar 26, 2020 · 2 comments · Fixed by #54068
Closed

[Transform] continuous transform date_histogram group_by performance #54254

hendrikmuhs opened this issue Mar 26, 2020 · 2 comments · Fixed by #54068
Labels

Comments

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Mar 26, 2020

Affected versions: < 7.7

Problem

Continuous Transforms are optimized for usecases, where sessions are grouped using terms. Grouping on date_histogram - e.g. per hour metrics - with large datasets suffers from re-writing all buckets for every checkpoint. This causes a lot of load on the cluster and might result in service degradation.

Mitigation

Rollup is optimized for this usecase and provides - via rollup search - aggregations on aggregations. Please consider using rollup instead.

Transform will provide an optimization for grouping on date_histogram with version 7.7. Please consider upgrading to 7.7. (Note that you can use a separate cluster for transform as transform supports CCS)

For this optimization to kick in, the field you configure for sync must be the same field you configure for the date_histogram group_by. Using multiple group_by is still possible, the transform gets optimized for the group_by where the field matches the field for time-based sync.

If you can not switch to rollup and upgrading to 7.7 is not possible, you can workaround the problem by adding a query filter that filters out data, you know is not required for updating the transform:

"range" : {
    "TIMESTAMP_FIELD" : {
        "gte" : "FILTER_VALUE",
    }
}

TIMESTAMP_FIELD should be the same that you use for date_histogram as well as sync.

The FILTER_VALUE should exclude at least everything before delay + interval. Also take bucket rounding into account.

For example if you group every 5 minutes and your ingest delay is 1 minute, the query should filter out everything older than 6 minutes. You can use date time logic for creating an absolute value: now.

Examples:

  • now-1h/h excludes everything older than 1 hour rounded down to the hour
  • now-1d/d excludes everything older than 1 day rounded down to the day.

Note: This does not have to be exact, you can filter less. However it is important to round down to a start of a bucket. Without rounding down, transform will overwrite older buckets with wrong/incomplete data.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@hendrikmuhs
Copy link
Contributor Author

Runtime statistic example for optimization:

Dataset: user reviews, 5.3 million reviews
feed: 20 events/s
transform config:
frequency 10s
date_histogram 1m

Run input documents documents written index time in ms search time in ms
base (batch) 5261600 452 11 407
continuous before 7120905395 611556 22647 367183
continuous after 20843400 3157 11421 5345

Note that because the frequency 10s was smaller than the bucket interval of 60s, the last bucket result had to be re-written multiple times. With a frequency closer to the bucket interval the continuous transform would be closer to the batch transform.

hendrikmuhs pushed a commit that referenced this issue Mar 26, 2020
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written.

fixes #54254
hendrikmuhs pushed a commit that referenced this issue Mar 26, 2020
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written.

fixes #54254
hendrikmuhs pushed a commit that referenced this issue Mar 26, 2020
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written.

fixes #54254
yyff pushed a commit to yyff/elasticsearch that referenced this issue Apr 17, 2020
optimize transform for group_by on date_histogram by injecting an additional range query. This limits the number of search and index requests and avoids unnecessary updates. Only recent buckets get re-written.

fixes elastic#54254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants