Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Shradha26 · 2023-09-20T12:45:09Z

Description

Right now (at least int IntTaxonomyFacets), we choose an integer array for counting if the number of hits is greater than 10% of the index; else we count sparsely. Counting densely may still not be a good idea if we have a huge number of hits but they only belong to a few labels - in effect making the values sparse.

We can use something like DocIdSetBuilder does: At first it uses a sparse structure to gather documents, and then upgrades to a non-sparse bit set once enough hits match.

FloatTaxonomyFacets seems to always collect densely. Maybe it's also worth adding this decision making to the parent class instead?

The text was updated successfully, but these errors were encountered:

gautamworah96 · 2023-10-26T22:10:31Z

@Shradha26 I would like to pick this task up in case you are not already working on it.

In your benchmark/efforts, have you tried to test if these sparse vs dense arrays provide any benefit at all? Basically, could it be the case that, the performance in both is basically the same. In that case, we can just stick to one and simplify the logic..

stefanvodita · 2023-10-27T21:46:05Z

The way I see it, the sparse/dense decision for those faceting data structures tries to balance two optimizations.

Reduce the amount of memeory in use.

The dense solution has a fixed memory cost. The sparse solution has a variable memory cost, proportional to the number of ordinals present in the match-set. Depending on when the map gets resized, the sparse solution ends up overtaking the dense one in terms of used memory when it records 1/4 to 1/2 of all the existing ordinals. If we are to dynamically switch form sparse to dense, the question we're answering is whether we expect to see more than 1/4 of all ordinals in the match-set. A good heuristic might be that the larger a taxonomy is, the less likely it is we will encounter a large portion of it in a query. At the same time, the larger a taxonomy is, the more important memory use is, to avoid running out of memory altogether. This makes me wonder if we can make the sparse/dense decision purely based on the size of the taxonomy.

Be fast.

The sparse solution has to do some extra work hashing the ordinals and maybe it also has worse memory locality. It's a fair question whether that difference is measurable.
If we go along with the idea above that small taxonomies should use dense values, then switching from dense to sparse values is even less likely to produce a measurable change. Maybe we can use the sparse data structure always?

In any case, this is mostly speculation. We would have to test it out.

mikemccand · 2023-10-29T16:02:16Z

I think we also must factor in the size (cardinality) of the taxonomy ordinals for that dimension. E.g. a color field that has at most a few hundred unique values really should use dense collection? The memory is bounded and small, and CPU cost is lower for dense collection.

mikemccand · 2023-10-29T16:02:44Z

Could we make the collection dynamic? Collect into a sparse structure at first, and if it gets too big, switch to dense.

gautamworah96 · 2023-11-13T21:31:03Z

I am not actively working on this problem as of now.
I am still in the process of figuring out what would be the correct thing to test/do here as a first step. Jotting down some thoughts that I had.

We could try the Collect into a sparse structure at first, and if it gets too big, switch to dense. experiment. It adds a costs to the setValue call when it will have to suddenly switch to a dense structure. But that should be fine? It is all within the same query. So no cases of a single query in a sea of queries suddenly being too slow. What heuristic on when to switch to use here?
Do we need to make the setValue call operation thread safe? How will that work with the dynamic changing?

luceneutil needs to be hacked to record JFR stats (for vanilla mainline vs candidate runs) for determining if this change improves the heap allocation or not.

gsmiller · 2023-11-30T00:14:37Z

Could we make the collection dynamic? Collect into a sparse structure at first, and if it gets too big, switch to dense.

+1 to exploring this approach. Right now it's all up-front heuristic based. If it proves cheap enough to switch, it could be a good compromise to switch directions as we get more information by visiting the collected hits.

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

Shradha26 added the type:enhancement label Sep 20, 2023

stefanvodita mentioned this issue Sep 23, 2023

Compute multiple float aggregations in one go #12547

Open

stefanvodita mentioned this issue Nov 21, 2023

Is it correct for facets to assume positive aggregation values? #12585

Closed

gsmiller added the module:facet label Nov 30, 2023

stefanvodita mentioned this issue Dec 22, 2023

Reduce duplication in taxonomy facets; always do counts #12966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Shradha26 commented Sep 20, 2023 •

edited

Loading

gautamworah96 commented Oct 26, 2023

stefanvodita commented Oct 27, 2023

mikemccand commented Oct 29, 2023

mikemccand commented Oct 29, 2023

gautamworah96 commented Nov 13, 2023 •

edited

Loading

gsmiller commented Nov 30, 2023

Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Comments

Shradha26 commented Sep 20, 2023 • edited Loading

Description

gautamworah96 commented Oct 26, 2023

stefanvodita commented Oct 27, 2023

mikemccand commented Oct 29, 2023

mikemccand commented Oct 29, 2023

gautamworah96 commented Nov 13, 2023 • edited Loading

gsmiller commented Nov 30, 2023

Shradha26 commented Sep 20, 2023 •

edited

Loading

gautamworah96 commented Nov 13, 2023 •

edited

Loading