Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] add support for missing bucket #59591

Merged
merged 12 commits into from Jul 29, 2020

Conversation

hendrikmuhs
Copy link
Contributor

@hendrikmuhs hendrikmuhs commented Jul 15, 2020

add support for "missing_bucket" in group_by

fixes #42941
fixes #55102

@hendrikmuhs
Copy link
Contributor Author

hendrikmuhs commented Jul 20, 2020

Discussion

missing_bucket and missing

With missing_bucket this PR exposes the composite aggregation syntax. However in aggregations it's common to have a configuration option called missing.

missing_bucket

  • maps non-existing fields to a bucket with key null
    • the query for retrieving these buckets is the same for source and dest (!_exists_:field)
  • if in a transform you want to map null buckets to a certain key instead, you need an extra ingest pipeline set processor
    • its not possible to collapse the bucket, e.g. if you want a default
      • even worse, wrong collapsing can lead to overwriting and therefore create broken results

missing

  • maps non-existing fields to a bucket with a given key, key must not be null
  • can clash (on purpose) with with existing fields, so you can collapse buckets, e.g. "missing":0 in a histogram
  • allowing "missing": null would be possible as it is valid json, however for high-level clients this is problematic as some default to remove the whole entry if the value is null, therefore it's better to keep disallow null and handle this specific case with "missing_bucket":true

Adding missing to composite aggs

In order to support missing (#48243) we 1st require support in composite aggs, see #60043.

@hendrikmuhs hendrikmuhs marked this pull request as ready for review July 22, 2020 09:58
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this looks good although it's hard to see the important bits because they're swamped by the simple changes to pass around the extra setting.

I left one question about how it is enforced that missing bucket cannot be used with the time sync field.

@hendrikmuhs
Copy link
Contributor Author

Basically this looks good although it's hard to see the important bits because they're swamped by the simple changes to pass around the extra setting.

true, sorry. During implementation I spotted missing code (client builders for geo) and bugs (missing/wrong equals and hash functions).

The important bits are in CompositeBucketsChangeCollector.java, especially the complex logic to find new documents that miss certain fields (see the query trees in the code comments).

@hendrikmuhs
Copy link
Contributor Author

I re-worked the missing_bucket handling and introduced a audit/log if the continuous transform uses a configuration that does not allow at least 1 optimization.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor logging comment.

The change overall looks good.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs hendrikmuhs merged commit 004388f into elastic:master Jul 29, 2020
@hendrikmuhs hendrikmuhs deleted the transform-missing-bucket-2 branch July 29, 2020 12:00
hendrikmuhs pushed a commit that referenced this pull request Jul 30, 2020
add support for "missing_bucket" in group_by

fixes #42941
fixes #55102
backport #59591
@nicpenning
Copy link

This issue possibly mentioned in the community Slack (by me). Figured I would drop it here for others who come across this if I am following correctly.

If the field you are grouping by does not exist in the document, is it expected that the document will not be in the transformed index?
For example. I have a transform that I expect to have 5 documents but only has 4 after the transform because 1 document does not contain the file.name field at all.
If I clone that transform and remove the file.name field from the group by, I get all 5 documents in the transform.

Demo:
magic (1)

@benwtrent
Copy link
Member

If the field you are grouping by does not exist in the document, is it expected that the document will not be in the transformed index?

Yes it is expected behavior. We only grab docs that have ALL the group_by fields.

With the additional support of missing_bucket, we should be able to grab documents that are missing the group_by fields.

@nicpenning
Copy link

Hey @hendrikmuhs and team!

Adding "missing_bucket": true to the "terms": { } field in the Edit JSON Config (Kibana) during the creation of a new transform does not get saved. The setting seems to get removed by the time it gets created. So basically this does not work in the Edit JSON Config section when creating a new transform.

And on a side note which will likely be worth of it's own issue is that editing the JSON Config in Kibana is very slow. Just clicking the Edit JSON takes about 20 seconds for the JSON to load. Then on top of that when you click inside of the loaded JSON and hit a key (such as enter/new line) it takes another 20 seconds. So the editing of this JSON Config is painful to use.

Here is a snippet of me trying to add "","" to the Edit JSON Config to demonstrate slowness. (Is it querying every time I make a change?)
magic2

On a positive note, the API for creating this works flawlessly. 👌

@hendrikmuhs
Copy link
Contributor Author

@nicpenning

The edit issue is tracked in elastic/kibana#83266 and should be fixed in the next version.

I don't know the reason of the perceived slowness. @walterra, do you know?

@walterra
Copy link

walterra commented Feb 1, 2021

@nicpenning Hey Nic, how many fields does the index have you're using in the transforms wizard? There's a known issue with indices with 1000+ fields resulting in a slow wizard: elastic/kibana#78590

@nicpenning
Copy link

Ahh, that could be it.

I am using filebeat-*.

Which I am sure you are aware contains a lot of fields if a person is using many integrations.

I imagine its well over 3K fields for that index pattern.

I suppose the work around is having a more narrow index pattern?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_transform API does not support missing_bucket [Transform] Add support for missing_bucket in group_by
7 participants