Is it correct for facets to assume positive aggregation values? #12585

stefanvodita · 2023-09-23T12:51:25Z

Description

In IntTaxonomyFacets and FloatTaxonomyFacets when we getTopChildrenForPath we maintain a priority queue of aggregation values and corresponding ordinals.
If an aggregation value is not positive, we discard it (code). If it is not larger than the lowest value in the queue (initially set to 0), we also discard it (code).

It looks like the facets assume aggregation results should be positive, but I think we could have negatives or zeroes. I wrote a unit test to demo this.
The current behavior makes sense for counts. Counts are always non-negative and we probably don't care about counts of 0. But does it make sense for other aggregations?

The text was updated successfully, but these errors were encountered:

gsmiller · 2023-09-29T17:47:30Z

Yeah, this is a good callout. I ran into this when adding more flexibility to association faceting a while back (making note that supporting, e.g., "min" would require rethinking these assumptions).

My opinion is that the current assumption makes sense for the faceting support currently available, but I know there's conversation going on more generally about improving (rethinking?) aggregation capabilities in Lucene. My preference on this would be to, 1) leave this current behavior in place unless there is a use-case that's immediately blocked by it, but 2) include it in some broader rethinking of Lucene aggregation capabilities. As a side-note on that, I wonder if a successful approach to moving forward with some new aggregation thinking would be to not try to modify the faceting module as-is, but rather spin up a new "aggregations" module under "sandbox" to start sketching out ideas. I think it will be difficult to retrofit more flexible aggregation capabilities into the faceting API that exists today, but maybe I'm wrong? OK, I'm off in the weeds now...

stefanvodita · 2023-09-29T20:47:25Z

not try to modify the faceting module as-is, but rather spin up a new "aggregations" module

I'm definitely leaning that way too right now. @Shradha26 and I were considering this (#12553). I initially thought faceting can fill most of the gaps we were identifying. What got me to change my mind was seeing how much extra work there is to bring in a new feature when it has to be reimplemented for multiple faceting classes. If we made facets generic, that would already be a significant rework, so maybe we might as well start fresh? There's other things that are difficult to do with facets - arbitrary aggregation groups come to mind, but genericity is the one that convinced me that it's a better strategy to have a dedicated aggregation engine. I'm sure there are counter-arguments as well or maybe I'm overstating the impact of genericity.

stefanvodita · 2023-11-21T23:59:46Z

I thought some more about this issue and it really seems like a bug that I can have a non-positive aggregation value, but I can't return it in top children.
The fundamental problem is that we don't know if we computed a value of 0 for an ordinal or if we never encountered that ordinal. I can think of three approaches:

Choose a magic value that communicates to us that an ordinal was not encountered. This magic value is 0 right now. Maybe Integer.MIN_VALUE and Float.MAX_VALUE might be better choices, allowing for more valid aggregation values.
Store a boolean for each ordinal. This would produce correct results, but cost more memory.
Use a map to store the arrays. IntTaxonomyFacets already does this in some cases and I think this might be appropriate in enough cases that we can always use a map.

gsmiller · 2023-11-30T00:18:40Z

I think I'd be curious if there are many real-world use-cases out there that have a non-positive association between each document and facet label? I would guess it's pretty uncommon. I actually can't really contrive an example either. @stefanvodita have you encountered real-world use-cases that would benefit from this? Apologies if you describe it somewhere and I'm missing it.

stefanvodita · 2023-11-30T14:20:31Z

I have written a unit test to showcase the problem, but I haven't seen it in the wild.

Non-positive associations make sense when we're dealing with data that can naturally be non-positive. Let's say we're managing a school's report cards. Each student has a corresponding document. They take multiple-choice tests, which penalize wrong answers to discourage randomly filling in the test. Each test the students take during the year is represented by a facet label and each student has their score associated with the label. Not all scores would be positive and aggregations over those scores may or may not be positive. Maybe that's too much speculation on my part though.

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

stefanvodita added the type:bug label Sep 23, 2023

stefanvodita mentioned this issue Dec 22, 2023

Reduce duplication in taxonomy facets; always do counts #12966

Merged

stefanvodita mentioned this issue Mar 22, 2024

Fix TestTaxonomyFacetValueSource.testRandom #13198

Open

stefanvodita closed this as completed in #12966 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it correct for facets to assume positive aggregation values? #12585

Is it correct for facets to assume positive aggregation values? #12585

stefanvodita commented Sep 23, 2023 •

edited

gsmiller commented Sep 29, 2023

stefanvodita commented Sep 29, 2023

stefanvodita commented Nov 21, 2023 •

edited

gsmiller commented Nov 30, 2023

stefanvodita commented Nov 30, 2023

Is it correct for facets to assume positive aggregation values? #12585

Is it correct for facets to assume positive aggregation values? #12585

Comments

stefanvodita commented Sep 23, 2023 • edited

Description

gsmiller commented Sep 29, 2023

stefanvodita commented Sep 29, 2023

stefanvodita commented Nov 21, 2023 • edited

gsmiller commented Nov 30, 2023

stefanvodita commented Nov 30, 2023

stefanvodita commented Sep 23, 2023 •

edited

stefanvodita commented Nov 21, 2023 •

edited