New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other and missing bucket support for terms agg #15525

Merged
merged 25 commits into from Jan 5, 2018

Conversation

@ppisljar
Member

ppisljar commented Dec 11, 2017

release note: 'other' and 'missing' bucket for the terms aggregation
resolves #1961

  • the other bucket will show up if its enabled, and will include everything that is not in top N (size of term agg)
  • if missing is enabled (label is set) it might show up as a separate bucket (in case its actually one of the top N buckets), if its not it will be included in the other bucket
@alexfrancoeur

This comment has been minimized.

Contributor

alexfrancoeur commented Dec 12, 2017

@ppisljar looking good! A few comments, observations and open questions below.

I realize this is early, but want to state we probably need a better value than _other_ here. Other is pretty common
screen shot 2017-12-11 at 6 56 16 pm

Rather than Show other maybe we can provide some additional context here. Group other values or Group values into 'Other' slice. We can always pull @gchaps in for an improved description as well.
screen shot 2017-12-11 at 7 10 48 pm

Seems like some sort of bug. I chart the top 25 values and only produce one. However, if I check other we see additional values.
screen shot 2017-12-11 at 7 03 24 pm
screen shot 2017-12-11 at 7 03 31 pm

I'm not sure if this is a side effect of other or a pie charts normal approach for visualizing too many slices in a small area. It looks like one single white value, maybe there's a better way for handling this.
screen shot 2017-12-11 at 7 03 49 pm

The other bucket does not seem to update with time
dec-11-2017 19-14-43

Filtering on a dashboard seems to work well. We'll want to make sure the label here is updated as well.
screen shot 2017-12-11 at 7 16 27 pm

Open questions:

  • Is there any need to have this option for splits that are not terms (significant terms, filters, ranges, etc.)?
  • Should this configuration be on by default for new visualizations?
@ppisljar

This comment has been minimized.

Member

ppisljar commented Dec 12, 2017

thanks @alexfrancoeur

  1. the value ... i used _other_ as Other might actually exist in your dataset (as a real value), i added an option to provide custom label for other bucket
  2. the top 25 produce only one value ('article') but other shows up, because there are probably documents where article is not set (at this point other is 'all the other values + missing values') ... there is another PR up for the missing bucket, which will then split the other bucket into 'all the other values' and 'missing' (two buckets)
  3. yeah that's a pie issue, and i would leave it out of this PR, when we have really small buckets you won't actually see them, it becomes just a white space
  4. time issue ... definitely a bug, looking into it, it should be fixed now
  5. the label on the filter will be the same as in the legend, as mentioned above i added an option so the user can supply a custom label, and we should come up with a good default

my answers to open questions:

  • for the first stage (this PR) i would limit this to terms only and wait for users to request it on other aggregations. personally i don't see much value in adding it to filters or ranges, as you could always define a filter on your own that includes everything your first N filters don't
  • i would not turn it on by default as 1. it would break backward compatibility, 2. this might be intensive on es cluster.

@ppisljar ppisljar requested a review from nreese Dec 12, 2017

@gchaps

This comment has been minimized.

Contributor

gchaps commented Dec 12, 2017

@alexfrancoeur @ppisljar I agree with Alex that the UI text "Show Other" should be more specific. Ping me when you're ready to discuss the text. I can also help with the wording for default values.

@ppisljar ppisljar requested a review from Bargs Dec 12, 2017

@ppisljar ppisljar changed the title from [WIP] other bucket support for terms agg to [WIP] other and missing bucket support for terms agg Dec 14, 2017

@thomasneirynck

sweeto! this is going to be a yuge addition.

This is an early PR, so some of my comments might be stale (if so, then ignore).

There's two things to evaluate I think:

  • I would decouple others and missing in two separte checkbox UIs.
  • I'd do determination of what the actual query has to become up front, in a pre-flight request instead of a postFlight request. It'd preserve the contract of Terms better imho.
type="number"
min="1"
>
<div class="vis-editor-agg-form-row">

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

After using these for a while, I'd separate out both options. By combining them it does reduce the footprint a little, but it also makes it less clear what exactly is going to happen.

So I'd just have two parallel options:

  • A 'show other' checkbox and the other bucket label as one thing we can toggle.
  • A 'show missing' checkbox and the missing bucket label as one thing we can toggle.

This comment has been minimized.

@timroes

timroes Dec 15, 2017

Contributor

I also think two separate settings would be better. I am not sure if we would still need a checkbox for the "show missing" or if we could just use a textfield for that, labeled something like: "Replace missing values with" and if the field is empty just don't use it, and if it's set to something replace missing values with it. I think we don't need a "user wants to show missing, but don't want to specify a custom label" option for that.

@@ -60,11 +62,33 @@ export function AggTypesBucketsTermsProvider(Private) {
return agg.getFieldDisplayName() + ': ' + params.order.display;
},
createFilter: createFilter,
postFlightRequest: async (aggConfigs, aggConfig, searchSourceAggs, resp, nestedSearchSource) => {
const filterAgg = buildOtherBucketAgg(aggConfigs, searchSourceAggs, aggConfig, resp);

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

this code-path should not be hit when the other functionality is turned off

}).then(async resp => {
for (let i = 0; i < vis.aggs.length; i++) {
const agg = vis.aggs[i];
if (!agg.type || !agg.type.postFlightRequest) continue;

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

I would add postFlightRequest to every type but just have it be the identity-function. Then we don't have to do any typechecking here.

This comment has been minimized.

@timroes

timroes Dec 15, 2017

Contributor

I think we should rather be as failsafe as possible and check if the function exists here before calling it, instead of relying on some complete other place always attaching the identity function for us.

@@ -0,0 +1,143 @@
import _ from 'lodash';

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

this module seems like an unnecessary abstraction imho, it just scatters the code. I'd add these functions as methods on the Terms bucket agg.

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

I think since this file alrady contains 179 lines, we shouldn't merge it into the terms class for better readability and there is not much of a harm to make them an own module. Maybe just rename the file to begin with an underscore to make it more clear, this are private helper functions.

return;
}
filterAggDsl.filters[key] = {

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

imho it'd help readability when not creating temp vars. Just access this as resultAgg.filters

return [];
};
const getAggConfigResult = (responseAggs, aggId, bucketKey) => {

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

I think we have a bug when splitting charts/slices. we get something like:

    at getAggConfigResult (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246452:19)
    at http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246460:85
    at http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21259:15
    at baseForOwn (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:20232:14)
    at http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21229:18
    at Function.<anonymous> (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21532:13)
    at getAggConfigResult (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246459:20)
    at updateMissingBucket (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246555:28)
    at Object._callee$ (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246151:17)
    at tryCatch (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:81181:40)```

_.each(responseAggs, agg => {
resultBuckets = [
...resultBuckets,
...getAggConfigResult(agg.aggs, aggId, bucketKey)

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

we probably don't want to recurse if we don't have agg.aggs. Or put a guard on top. in any case, we probably want a base-case in this recursive function somewhere.

const agg = vis.aggs[i];
if (!agg.type || !agg.type.postFlightRequest) continue;
const nestedSearchSource = new SearchSource().inherits(searchSource);
resp = await agg.type.postFlightRequest(vis.aggs, agg, searchSource.get('aggs')(), resp, nestedSearchSource);

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 14, 2017

Contributor

I'm not a 100% convinced this postFlightRequest is the correct abstraction.

This only works for terms, where we can conceive of something like "other", where this configuration requires more calls. But this wouldn't transfer to other aggregations, if a postflight makes sense. This also bleeds into courier in kind of a weird way. Also, the toDSL functionality is now broken for Terms. The configuration of terms and all its outputs (dsl, responses, ...) no longer matches the actual result.

Could we flip this? Instead of introducing a postFlightRequest, do preliminary requests at the beginning.

In either case, I think it would preserve the contract of the "Terms" configuration better.

This comment has been minimized.

@ppisljar

ppisljar Dec 15, 2017

Member

i don't think that is possible, we discussed this with @nreese and found few reasons why to do the postFlight instead of preFlight ... i think the main one was that i need the response for this to work

  • i can not construct correct filters i need for the 'other' bucket construction if i don't have the response from original query back as this could be nested
  • we are updating the actual response with fake buckets, which can't really happen in pre-flight

This comment has been minimized.

@thomasneirynck

thomasneirynck Dec 15, 2017

Contributor

The idea would be that you make the original request in the preflight, and collect the other responses in a similar fashion as here, except for the last one. Your last DSL will then be the correct one, matching the terms-config. Right now, we do the first request (not matching the terms config) and stringing together all the subsequent ones in the postflight.

const filter = _.cloneDeep(bucket.filter) || currentAgg.createFilter(bucketKey);
delete filter.meta;
const migratedFilter = migrateFilter(filter.query || filter);
const newFilters = [...filters, migratedFilter];

This comment has been minimized.

@Bargs

Bargs Dec 15, 2017

Contributor

You might want to use the buildQueryFromFilters helper here. It'll migrate and clean (remove meta prop) the filters exactly how we do it in SearchSource and return a valid bool query body that you can use.

// create not filters for all the buckets
const notKeys = agg.buckets.map(bucket => bucket.key);
filterAggDsl.filters[key].bool.must_not[0].terms[aggWithOtherBucket.params.field.name] = notKeys;

This comment has been minimized.

@Bargs

Bargs Dec 15, 2017

Contributor

I don't think you want to use a terms query. If a user is aggregating on an analyzed field (with field data turned on) these term queries won't match properly. Scripted fields also won't work with the current implementation. I think a negated phrases filter would work. It'll create match_phrase queries, so my only question would be whether the terms agg supports any data types that won't work with match_phrase.

It could also make more sense to use the agg's createFilter method, that way you know you're getting a valid filter for this particular agg and field type. It already uses a filter constructor under the hood. Negate them all and then combine them with a bool using the buildQueryFromFilters helper I mentioned above.

(as a general rule, if you ever find yourself manually building query DSL in your code, there's a good chance you should be using one of our filter/query abstractions instead)

@alexfrancoeur

This comment has been minimized.

Contributor

alexfrancoeur commented Dec 16, 2017

@ppisljar some quick weekend feedback.

I noticed that the behavior of an empty pie chart is a bit odd. Not sure if it's related to the PR. At first I didn't realize I was missing data, just thought that there was an issue loading the pie chart. However, if I resize the window the error message occurs. Check out this gif

dec-16-2017 10-02-35

A couple of additional comments below.

screen shot 2017-12-16 at 10 03 26 am

As a Kibana end user - I don't know the difference between other and missing buckets. If it's necessary to have both I think we need to explicitly state what they are. In other UI's this is normally represented as "Others".

Should a user have a choice to choose a missing and/or other bucket? I understand that elasticsearch uses _other_ as the label, but from a UI perspective it's not that appealing. Using Other by default seems to be a bit more user friendly.

I feel we're running into issues here within the UI that we have in the past, using elasticsearch terminology vs. what would be intuitive for building the visualization. I don't have any good recommendations at the moment but it's something to think about. Maybe we can discuss some options early next week?

@ppisljar

This comment has been minimized.

Member

ppisljar commented Dec 18, 2017

OK so here is what we have at the moment (not part of this PR yet):

screenshot-localhost-5601 2017-12-18 16-40-45-632

we need to come up with:

  • text next to the checkboxes
  • text for the input boxes (label for other/missing bucket)
  • text for the info icons (not present yet) which can give a longer description of what exactly is happening
@gchaps

This comment has been minimized.

Contributor

gchaps commented Dec 18, 2017

@ppisljar Let's try this:

Group other values into separate bucket

Label for other bucket

Show missing values

Label for missing values

Question: are the missing & null values in their own bucket or in the "other" bucket

@ppisljar ppisljar force-pushed the ppisljar:enh/otherBucket branch 2 times, most recently from aba94e8 to ae0e053 Dec 20, 2017

@timroes

I tried to break it and haven't found a way so far. Everything seems to work as expected, even if you try to nest different aggregations in weird ways.

I have some more code suggestions and questions, but since I am now away for 2 weeks, these shouldn't block this PR if it would get ready otherwise.

Also I would like to see some more tests and documentation (especially in the terms_other_bucket_helper file), but that's anyway on the todo list.

* A function that will be called after the main request has been made
* and should return an updated response
*/
this.postFlightRequest = config.postFlightRequest || null;

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

Add the parameters, that will be passed to the postFlightRequest function to the documentation.

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

And that the response type is allowed to be async (or a promise).

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

Is the AggConfig class tied towards the courier request handler, or could this possibly be used with other request handlers? As far as I understand, the AggConfig will always be used when using the default editor with schemas. That doesn't necessarily mean you need the courier request handler though?

If that's the case, I think we should rather implement the postFlightRequest outside of the courier request handler and inside the generic calling the request handler (in visualize.js?). If there are reasons against it, I would at least document here, that the postFlightRequest only works when using the courier request handler.

This comment has been minimized.

@ppisljar

ppisljar Jan 3, 2018

Member

currently default editor Data tab is tied to the AggConfig and to Courier. I think AggCpnfigs are only used with courier.

@@ -0,0 +1,143 @@
import _ from 'lodash';

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

I think since this file alrady contains 179 lines, we shouldn't merge it into the terms class for better readability and there is not much of a harm to make them an own module. Maybe just rename the file to begin with an underscore to make it more clear, this are private helper functions.

@@ -49,6 +49,7 @@ export function FilterBarClickHandlerProvider(Notifier, Private) {
}
}
})
.flatten()

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

So simple and yet so powerful!

@@ -26,7 +28,7 @@ const CourierRequestHandlerProvider = function (Private, courier, timefilter) {
return new Promise((resolve, reject) => {
if (shouldQuery()) {
delete vis.reload;
searchSource.onResults().then(resp => {
searchSource.onResults().then(async resp => {

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

I think the async here isn't needed?

@@ -35,15 +37,22 @@ const CourierRequestHandlerProvider = function (Private, courier, timefilter) {
};
searchSource.rawResponse = resp;
resolve(resp);
resolve(_.cloneDeep(resp));

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

The reason for cloning this, was the spy panel? So this would be redundant with the refactored requests panel? (Doesn't mean that we should remove it here, I just want to make sure I understand it.)

}).then(async resp => {
for (let i = 0; i < vis.aggs.length; i++) {

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

We could use for (const agg of vis.aggs) here, since we don't need the index i anywhere inside the loop.

const filterAgg = buildOtherBucketAgg(aggConfigs, searchSourceAggs, aggConfig, resp);
nestedSearchSource.set('aggs', filterAgg);
const response = await nestedSearchSource.fetchAsRejectablePromise();
// todo: refactor to not have side effects

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

I think that comment can go away, seems, the function doesn't have any side effects anymore.

* @param otherAgg: AggConfig of the aggregation with other bucket
*/
const getOtherAggTerms = (requestAgg, key, otherAgg) => {
return requestAgg['other-filter'].filters.filters[key].bool.must_not.map(filter => {

This comment has been minimized.

@timroes

timroes Dec 22, 2017

Contributor

Maybe this version looks a bit nicer:

return requestAgg['other-filter'].filters.filters[key].bool.must_not.filter(filter =>
  filter.match_phrase && filter.match_phrase[otherAgg.params.field.name]
).map(filter =>
  filter.match_phrase[otherAgg.params.field.name].query
);

@thomasneirynck thomasneirynck self-requested a review Dec 26, 2017

@alexfrancoeur

LGTM with one final comment. While these new options are keyboard accessible, I don't seen a bounding box or indicator that I'm in the checkbox. Would you mind adding the appropriate styling here? We should visually indicate that the box is in focus.

@ppisljar ppisljar force-pushed the ppisljar:enh/otherBucket branch from 9a11a10 to 70b6a57 Jan 3, 2018

@ppisljar ppisljar changed the title from [WIP] other and missing bucket support for terms agg to other and missing bucket support for terms agg Jan 3, 2018

@ppisljar ppisljar force-pushed the ppisljar:enh/otherBucket branch from 1a98010 to f4ce5e3 Jan 4, 2018

ppisljar added some commits Dec 12, 2017

@ppisljar ppisljar force-pushed the ppisljar:enh/otherBucket branch from 84c55f0 to faf0eca Jan 5, 2018

@nreese

This comment has been minimized.

Contributor

nreese commented Jan 5, 2018

@ppisljar This is great. Nice job.

There is one point of UI confusion. When adding a filter on the Other bucket, the UI does not clearly reflect that the Other filter is a not filter.

screen shot 2018-01-05 at 6 22 35 am

The created filters work great and make a lot of sense

screen shot 2018-01-05 at 6 24 13 am

@ppisljar

This comment has been minimized.

Member

ppisljar commented Jan 5, 2018

thanks @nreese, above is not relevant to this PR, as the same happens with any NOT filter. please create an issue for it.

@ppisljar ppisljar merged commit 2fd41d5 into elastic:master Jan 5, 2018

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
kibana-ci Build finished.
Details

ppisljar added a commit to ppisljar/kibana that referenced this pull request Jan 5, 2018

ppisljar added a commit that referenced this pull request Jan 8, 2018

@aphelionz aphelionz referenced this pull request May 7, 2018

Closed

Accessibility: Kibana 6.3 Meta Issue #18866

34 of 48 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment