-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport to 9x: Reduce duplication in taxonomy facets; always do counts #12966 #13358
Backport to 9x: Reduce duplication in taxonomy facets; always do counts #12966 #13358
Conversation
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Despite the "annoying" bits in the description, I don't expect this backport to be controversial, but reviews are welcome! I plan to wait over the weekend and then merge. |
…#13284) Our per-field vector and doc-values readers use `TreeMap`s but don't rely on the iteration order, so these `TreeMap`s can be replaced with more CPU/RAM-efficient `HashMap`s. The per-field postings reader stays on a `TreeMap` since it relies on the iteration order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stefanvodita!
Make sure to update the main
lucene/CHANGES.txt
also to move the entry down to 9.11.0 section.
Thank you for the review, Mike! I'd already put the CHANGES entries in 9.11 tentatively, now they're correct 😄 |
Annoying things I had to do to backport #12966:
TaxonomyFacetCounts
stores the counts twice because it extendsIntTaxonomyFacets
. The correct way would be to extendTaxonomyFacets
, but I couldn't make that change, like I could forFastTaxonomyFacets
, which was marked@lucene.experimental
.useHashTable
androllup
inTaxonomyFacets
compared tomain
, from default to protected.TopOrdAndInt/FloatQueue
as they were, since they are public API, and create new queues for our purposes,TopOrdAndInt/FloatNumberQueue
, marked deprecated because they will go away in 10.x.