Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always collect sparsely in TaxonomyFacets & switch to dense if there are enough unique labels #12576

Open
Shradha26 opened this issue Sep 20, 2023 · 6 comments

Comments

@Shradha26
Copy link

Shradha26 commented Sep 20, 2023

Description

Right now (at least int IntTaxonomyFacets), we choose an integer array for counting if the number of hits is greater than 10% of the index; else we count sparsely. Counting densely may still not be a good idea if we have a huge number of hits but they only belong to a few labels - in effect making the values sparse.

We can use something like DocIdSetBuilder does: At first it uses a sparse structure to gather documents, and then upgrades to a non-sparse bit set once enough hits match.

FloatTaxonomyFacets seems to always collect densely. Maybe it's also worth adding this decision making to the parent class instead?

@gautamworah96
Copy link
Contributor

@Shradha26 I would like to pick this task up in case you are not already working on it.

In your benchmark/efforts, have you tried to test if these sparse vs dense arrays provide any benefit at all? Basically, could it be the case that, the performance in both is basically the same. In that case, we can just stick to one and simplify the logic..

@stefanvodita
Copy link
Contributor

The way I see it, the sparse/dense decision for those faceting data structures tries to balance two optimizations.

  1. Reduce the amount of memeory in use.

The dense solution has a fixed memory cost. The sparse solution has a variable memory cost, proportional to the number of ordinals present in the match-set. Depending on when the map gets resized, the sparse solution ends up overtaking the dense one in terms of used memory when it records 1/4 to 1/2 of all the existing ordinals. If we are to dynamically switch form sparse to dense, the question we're answering is whether we expect to see more than 1/4 of all ordinals in the match-set. A good heuristic might be that the larger a taxonomy is, the less likely it is we will encounter a large portion of it in a query. At the same time, the larger a taxonomy is, the more important memory use is, to avoid running out of memory altogether. This makes me wonder if we can make the sparse/dense decision purely based on the size of the taxonomy.

  1. Be fast.

The sparse solution has to do some extra work hashing the ordinals and maybe it also has worse memory locality. It's a fair question whether that difference is measurable.
If we go along with the idea above that small taxonomies should use dense values, then switching from dense to sparse values is even less likely to produce a measurable change. Maybe we can use the sparse data structure always?

In any case, this is mostly speculation. We would have to test it out.

@mikemccand
Copy link
Member

I think we also must factor in the size (cardinality) of the taxonomy ordinals for that dimension. E.g. a color field that has at most a few hundred unique values really should use dense collection? The memory is bounded and small, and CPU cost is lower for dense collection.

@mikemccand
Copy link
Member

Could we make the collection dynamic? Collect into a sparse structure at first, and if it gets too big, switch to dense.

@gautamworah96
Copy link
Contributor

gautamworah96 commented Nov 13, 2023

I am not actively working on this problem as of now.
I am still in the process of figuring out what would be the correct thing to test/do here as a first step. Jotting down some thoughts that I had.

  1. We could try the Collect into a sparse structure at first, and if it gets too big, switch to dense. experiment. It adds a costs to the setValue call when it will have to suddenly switch to a dense structure. But that should be fine? It is all within the same query. So no cases of a single query in a sea of queries suddenly being too slow. What heuristic on when to switch to use here?
  2. Do we need to make the setValue call operation thread safe? How will that work with the dynamic changing?

luceneutil needs to be hacked to record JFR stats (for vanilla mainline vs candidate runs) for determining if this change improves the heap allocation or not.

@gsmiller
Copy link
Contributor

Could we make the collection dynamic? Collect into a sparse structure at first, and if it gets too big, switch to dense.

+1 to exploring this approach. Right now it's all up-front heuristic based. If it proves cheap enough to switch, it could be a good compromise to switch directions as we get more information by visiting the collected hits.

stefanvodita added a commit that referenced this issue Apr 5, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit to stefanvodita/lucene that referenced this issue May 10, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit that referenced this issue May 14, 2024
#12966 (#13358)

Reduce duplication in taxonomy facets; always do counts (#12966)

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants