Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added syntactic sugar over agg definition #5721

Closed
wants to merge 1 commit into from
Closed

Added syntactic sugar over agg definition #5721

wants to merge 1 commit into from

Conversation

uboness
Copy link
Contributor

@uboness uboness commented Apr 7, 2014

  • some aggregations typically tend to be requested together. But using the normal (formal) dsl can make the request too verbose. The agg factory/parsing infrastructure was changed to enable factories to automatically add more factories, and by that enables "syntatctic sugar" to save verbosity
  • one common use case for this is the missing aggregation. Often, when running values source aggs (e.g. terms, avg, etc...) one also wants to get the count of those document that are not associated with values (therefore were not aggregated). Now, using the new infrastructure, one can define a track_missing setting on any values source based aggregation, and a missing aggregation will automatically be injected into the request - based on the same value source.

Closes #5324

- some aggregations typically tend to be requested together. But using the normal (formal) dsl can make the request too verbose. The agg factory/parsing infrastructure was changed to enable factories to automatically add more factories, and by that enables "syntatctic sugar" to save verbosity

- one common use case for this is the `missing` aggregation. Often, when running values source aggs (e.g. terms, avg, etc...) one also wants to get the count of those document that are not associated with values (therefore were not aggregated). Now, using the new infrastructure, one can define a `track_missing` setting on any values source based aggregation, and a `missing` aggregation will automatically be injected into the request - based on the same value source.

Closes #5324
@roytmana
Copy link

roytmana commented Apr 8, 2014

@uboness I could not play with it and docs do not make iy clear if it replicate sub aggs defined within an aggs that are marked to track missing. In other words if I agg by state and then city and mark state agg to track missing there will be city buckets within missing bucket along with any accumulator aggs such as sum avg etc defined within city?

Also I feel it would be much more convenient to have missing as part of normal bucket array with key specified when enabling missing tracking for the agg rather than as sibling node. After all missing is just another value (granted somewhat special) of that field. It is just like the databases let you handle null values almost like the normal ones and with great ease and convenience

@uboness
Copy link
Contributor Author

uboness commented Apr 8, 2014

@roytmana adding track_missing in the states terms agg will add a sibling states_missing agg next to it. it will be an empty missing agg, with only the doc_count that will indicate the number of docs missing the field. you'll need to specify track_missing on the cities aggs to have cities_missing agg injected next to that one

Also I feel it would be much more convenient to have missing as part of normal bucket array with key specified when enabling missing tracking for the agg rather than as sibling node. After all missing is just another value (granted somewhat special) of that field. It is just like the databases let you handle null values almost like the normal ones and with great ease and convenience

so, for the missing, we decided this is the best we'll go with that regard, as adding the missing count to the aggs themselves also adds complexity we'd like to avoid. That said, we do plan very soon (this week perhaps) to let you define default values for missing docs, which will enable you to treat those like you'd treat null values in the db.

@roytmana
Copy link

roytmana commented Apr 8, 2014

Thanks for the info @uboness. Then as you said it is truly is a syntactic sugar adding little value if it does not replicate sub aggs defined within the agg with tracked null values. I would like to re-open my request since it solves none of the problems I outlined in it if you do not mind?

@uboness
Copy link
Contributor Author

uboness commented Apr 8, 2014

@roytmana well, it's there to solve the verbosity of the request, which seemed to be your main concern outlined above. As for the response, the difference in how you'd fetch the data is negligible I'd say:

aggs.states._missing

vs

aggs.states_missing.value

In any case, you can also wait for the default values feature I mentioned above and use that if it better fits your needs

@roytmana
Copy link

roytmana commented Apr 8, 2014

Thank you @uboness I will wait for default values.
The major complexity that is not solved is in having to replicate sub aggs within each missing definition. In my example I still want my city breakdown within missing for states and all the sub aggs I defined within the city.

It would have been very nice to have an optional "other" bucket in aggs as well :-)

public static ValueCountBuilder count(String name) {
return valueCount(name);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do it in a separate change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm all for it, but would just like to make sure we have a changes entry for it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@jpountz
Copy link
Contributor

jpountz commented Apr 8, 2014

It doesn't feel right to me to make process return sibling aggregators: it makes this method perform two very different things. Maybe we could instead make either the parser return several factories or the factories return several aggregators?

Although I think it is nice to users to have documentation of track_missing for every aggregation, I think the duplication will make it a pain to update. Maybe it should be written only once and then included or referenced in every aggregation that exposes it (so that there is a single place to update)?

Regarding the syntax, maybe we should make it future-proof in case we would also like to be able to replicate the sub-tree of aggregations under the missing agg? Something like:

"track_missing": {
    "term_missing": "counts"
}

where term_missing is the key under which the missing aggregation should be and counts tells that it should just track missing counts.

@uboness
Copy link
Contributor Author

uboness commented Apr 8, 2014

It doesn't feel right to me to make process return sibling aggregators: it makes this method perform two very different things. Maybe we could instead make either the parser return several factories or the factories return several aggregators?

I'm not sure I agree with that, the whole idea of syntactic sugar is that it's just applied on the syntax, which means this syntax represents a more elaborated syntax. So this behaviour should not (almost by definition) propagate to the underlying implementation of the factories or the aggregators... it should be done on the highest level possible.

Although I think it is nice to users to have documentation of track_missing for every aggregation, I think the duplication will make it a pain to update. Maybe it should be written only once and then included or referenced in every aggregation that exposes it (so that there is a single place to update)?

I tend to agree, there's just no place for common settings in the docs now. I did try to put it on the "main" aggs and share it between related ones (e.g. range, date_range, ip_range). Maybe we should consider adding a section for common configurations (also put the values source configuration there?), but I'd do that on a different PR

Regarding the syntax, maybe we should make it future-proof in case we would also like to be able to replicate the sub-tree of aggregations under the missing agg?

I don't think it's not future proof right now... if needed, in the future we could always introduce support for objects as well (nothing prevents us from doing that)

@jpountz
Copy link
Contributor

jpountz commented Apr 9, 2014

it should be done on the highest level possible.

I'm good with that, but can we keep concerns separated and not do both validation and return the siblings in the same method?

Maybe we should consider adding a section for common configurations (also put the values source configuration there?), but I'd do that on a different PR

OK, can you open an issue already and mark it as a blocker for 1.2 so that we don't forget?

I don't think it's not future proof right now... if needed, in the future we could always introduce support for objects as well (nothing prevents us from doing that)

OK.

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2015

this is stale for alomst 1 year. I am closing it for now - we can still revisit

@s1monw s1monw closed this Mar 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for "missing" to all bucket aggregations
6 participants