Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it correct for facets to assume positive aggregation values? #12585

Closed
stefanvodita opened this issue Sep 23, 2023 · 5 comments · Fixed by #12966
Closed

Is it correct for facets to assume positive aggregation values? #12585

stefanvodita opened this issue Sep 23, 2023 · 5 comments · Fixed by #12966
Labels

Comments

@stefanvodita
Copy link
Contributor

stefanvodita commented Sep 23, 2023

Description

In IntTaxonomyFacets and FloatTaxonomyFacets when we getTopChildrenForPath we maintain a priority queue of aggregation values and corresponding ordinals.
If an aggregation value is not positive, we discard it (code). If it is not larger than the lowest value in the queue (initially set to 0), we also discard it (code).

It looks like the facets assume aggregation results should be positive, but I think we could have negatives or zeroes. I wrote a unit test to demo this.
The current behavior makes sense for counts. Counts are always non-negative and we probably don't care about counts of 0. But does it make sense for other aggregations?

@gsmiller
Copy link
Contributor

Yeah, this is a good callout. I ran into this when adding more flexibility to association faceting a while back (making note that supporting, e.g., "min" would require rethinking these assumptions).

My opinion is that the current assumption makes sense for the faceting support currently available, but I know there's conversation going on more generally about improving (rethinking?) aggregation capabilities in Lucene. My preference on this would be to, 1) leave this current behavior in place unless there is a use-case that's immediately blocked by it, but 2) include it in some broader rethinking of Lucene aggregation capabilities. As a side-note on that, I wonder if a successful approach to moving forward with some new aggregation thinking would be to not try to modify the faceting module as-is, but rather spin up a new "aggregations" module under "sandbox" to start sketching out ideas. I think it will be difficult to retrofit more flexible aggregation capabilities into the faceting API that exists today, but maybe I'm wrong? OK, I'm off in the weeds now...

@stefanvodita
Copy link
Contributor Author

not try to modify the faceting module as-is, but rather spin up a new "aggregations" module

I'm definitely leaning that way too right now. @Shradha26 and I were considering this (#12553). I initially thought faceting can fill most of the gaps we were identifying. What got me to change my mind was seeing how much extra work there is to bring in a new feature when it has to be reimplemented for multiple faceting classes. If we made facets generic, that would already be a significant rework, so maybe we might as well start fresh? There's other things that are difficult to do with facets - arbitrary aggregation groups come to mind, but genericity is the one that convinced me that it's a better strategy to have a dedicated aggregation engine. I'm sure there are counter-arguments as well or maybe I'm overstating the impact of genericity.

@stefanvodita
Copy link
Contributor Author

stefanvodita commented Nov 21, 2023

I thought some more about this issue and it really seems like a bug that I can have a non-positive aggregation value, but I can't return it in top children.
The fundamental problem is that we don't know if we computed a value of 0 for an ordinal or if we never encountered that ordinal. I can think of three approaches:

  1. Choose a magic value that communicates to us that an ordinal was not encountered. This magic value is 0 right now. Maybe Integer.MIN_VALUE and Float.MAX_VALUE might be better choices, allowing for more valid aggregation values.
  2. Store a boolean for each ordinal. This would produce correct results, but cost more memory.
  3. Use a map to store the arrays. IntTaxonomyFacets already does this in some cases and I think this might be appropriate in enough cases that we can always use a map.

@gsmiller
Copy link
Contributor

I think I'd be curious if there are many real-world use-cases out there that have a non-positive association between each document and facet label? I would guess it's pretty uncommon. I actually can't really contrive an example either. @stefanvodita have you encountered real-world use-cases that would benefit from this? Apologies if you describe it somewhere and I'm missing it.

@stefanvodita
Copy link
Contributor Author

I have written a unit test to showcase the problem, but I haven't seen it in the wild.

Non-positive associations make sense when we're dealing with data that can naturally be non-positive. Let's say we're managing a school's report cards. Each student has a corresponding document. They take multiple-choice tests, which penalize wrong answers to discourage randomly filling in the test. Each test the students take during the year is represented by a facet label and each student has their score associated with the label. Not all scores would be positive and aggregations over those scores may or may not be positive. Maybe that's too much speculation on my part though.

stefanvodita added a commit that referenced this issue Apr 5, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit to stefanvodita/lucene that referenced this issue May 10, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit that referenced this issue May 14, 2024
#12966 (#13358)

Reduce duplication in taxonomy facets; always do counts (#12966)

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants