[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

Shradha26 · 2023-09-13T15:58:22Z

I’d like to gather a list of areas where Lucene’s support for aggregations can be improved and discuss if faceting can be augmented to offer that support or if it would need to be separate functionality. Please suggest more ideas or challenge the ones listed!

Description

Information Retrieval platforms built on top of Lucene like Solr, Elastic Search, and OpenSearch have rich aggregation engines that are different from what Lucene natively supports. Lucene has some unique ideas to make aggregation computation efficient. Some examples are -

side car taxonomy index that does some processing at index time to make search time computation of aggregations faster, but adds the hassle of maintaining another index, risk of it being corrupt, etc.
rolling up values from children to parents (post aggregation) to compute aggregations in some cases - reducing the overall number of values computed per aggregation.

Here are some ideas @stefanvodita and I encountered in our work and through exploration of what the other platforms support -

New features

Support dynamic group definitions: In Solr, defining an aggregation group is as simple as providing an iterator over documents. In Lucene, we can’t make arbitrary group definitions to that extent.
Support adding data for groups (ordinals). Association Facet Fields can do this; but one problem is that there’s no single authority for the data. For example, if we have an index of books where the Author is a Facet Field on the book document and we want to store the Author’s popularity, with Association Facet Fields, we’ll need to denormalize this value once per each book document. This introduces the possibility of some document changing/overwriting the intended value, inconsistently. For Taxonomy Facets, we could use the side car taxonomy index to efficiently add data about aggregation groups in normalized form: Index arbitrary fields in taxonomy docs #12337
Aggregation Expression Facets: Users may want to define expressions that reference other aggregation functions: [nocommit] Introduce ExpressionFacets along with a demo. #12184. Note that this is different from the “Expression Facets” in the current Lucene demos - those are “document expression facets” and are expressions defined using fields on the document and do not use aggregations in the definition.
Nested aggregations: This is similar to the idea of Expression Facets in that they are an aggregation over other aggregations, but this time the parent aggregation references aggregations from lower levels in a hierarchy. For example: if we have an index of books with a hierarchical Facet Field for Author like /, we want to be able to answer queries like “What is the nationality of the author with the most books”? To do this, we’ll need to compute an aggregation A1 = count(books) per Author (level 1 in the hierarchy) and then do a max(A1) per Nationality (level 2 in the hierarchy).
Cascading aggregation groups: I think this feature corresponds to nested facets in Solr. A clothing store could have products categorized by size and color, as different dimensions. They might want to give customers a navigation experience that breaks down sizes by color. The customer might like blue, but see that the store is in short supply of blue items of the right size, and select a different color instead.
API to associate aggregation groups with the aggregation functions: For single valued hierarchical facet fields, Lucene uses roll-ups to compute aggregation across the different levels. However, for all other cases, it uses an approach I’ll refer to as jump-ups - for each document, we iterate through all possible unique groups (ordinals) relevant to the document (irrespective of their relationship via a hierarchy), and update an aggregation accumulator corresponding to the group (ordinals). (This is the values array indexed by ordinal in TaxonomyFacet implementations). Right now, even the jump-up approach ends up computing the aggregation function across all groups. However, it has the potential to compute exactly the number of aggregations needed. By associating aggregation groups with the aggregation functions users are be interested in, Lucene can compute exact number of aggregations. It could also be beneficial in cases where users are interested in a variety of aggregations for different groups.
Enable deep hierarchies: Very deep ordinal hierarchies can have trouble scaling. Storing every ordinal in the path up to a label on a doc may not be feasible. Computing aggregations would be slower than it has to be if they are only needed for portions of the hierarchy. Short-circuiting roll-ups or aggregating directly into targeted ordinals would improve performance (see idea above).
Make facets generic: Some implementations depend on underlying primitives, which makes them efficient, but it also makes it so any improvement to IntTaxonomyFacets has to be reimplemented for FloatTaxonomyFacets. Also, if a user wants to define a new aggregation (e.g. AVG), they have to write a new faceting implementation too, since the existing facets don’t have generic accumulators. An AVG accumulator would need to maintain two values. The existing accumulators can only store one value per ordinal.

msokolov · 2023-09-19T13:16:56Z

So many ideas here! It's clear we have some room to grow this API. I wonder if we could organize them into a plan with dependencies and priorities. Also some of the ideas I'm not sure I understand. EG What do you mean by an aggregation group? Is this like counting documents that are either red or blue?

It seems like the first few things you called out (dynamic groupings, associating data with ordinals, facet aggregations that depend on each other, including nesting) are "user-facing" features and then the last two things are more like implementation or low-level API changes. Do we need to do the low-level changes to support these new user features?

Shradha26 · 2023-09-29T15:30:25Z

Thanks, Mike!

What do you mean by an aggregation group? Is this like counting documents that are either red or blue?

Yes, exactly.

Do we need to do the low-level changes to support these new user features?

For some of the ideas in here, yes. For example, the idea about support adding data for groups (ordinals) would require us to add some new behaviours for the taxonomy index - especially around updates to the index. I've also mentioned how Faceting implementations currently are not generic in that any improvement to IntTaxonomyFacets has to be reimplemented for FloatTaxonomyFacets.

As for the new features, it may be tricky to implement them over the existing Faceting implementation because it would require changing some assumptions. For an aggregation, I think we'll want to implement two concepts - an accumulator - where we keep a record of the running value for an aggregation & a reducer - which modifies the accumulator. In Lucene's faceting, an accumulator doesn't exist as a first class object; rather it is a feature of a class that performs the aggregation. This makes it difficult to define new aggregations - for example, Lucene's accumulators are defined in each aggregator class (this is any concrete child of Facets class) and assume there's only one value per-label, per-aggregation - like sum/count. But if were to implement other aggregations like average or median, we'd want to maintain multiple values per accumulator and then finally collapse them in some manner. Right now, we'd need to implement a new child of the Facets class, which doesn't isolate the reduction behaviour or allow generic accumulators to be passed in.

I wonder if we could organize them into a plan with dependencies and priorities.

Yes, I think that's a great idea. I'll try to organise the ideas in a dependency graph.

stefanvodita · 2023-10-06T08:06:17Z

One dependency I can point out is between the idea of nested aggregations and that of specific aggregation targets. With nested aggregations, we want to target some aggregation groups and exclude others. In the example above, we exclude nationalitites from the count aggregation and we exclude authors from the max aggregation.

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

sandeshkr419 · 2024-04-22T02:56:03Z

Thanks @stefanvodita, @Shradha26 for doing a thorough review of gaps with Lucene's faceting. I think that in summary, this is leaning in towards support for aggregation capabilities in Lucene out of the box given the present structure with Faceting is a stepping stone to aggregations but requires a lot of overhead by its consumers to actually implement aggregations.

Some of the points form my discussion with @msfroh:

As you already mentioned, that Lucene has some unique ideas to make aggregation computation efficient., but the caveat is that the gaps makes it very unlikely to actually use them in platforms like Elasticsearch, OpenSearch and any consumers who demand rich aggregations which is usually the case - isn't it that you need full aggregation capabilities?. What we think is that its high time that Lucene should have support for aggregations. To kickstart, the implementations can inspire from some of the already existing open-source projects like OpenSearch, Solr (I haven't checked those myself) but this can be open for discussion.

I was looking into OpenSearch's code base for aggregations and it looks like decoupling aggregations for Lucene might not be a very straightforward thing but its worth putting efforts into if the Lucene community thinks that rich aggregations will be a good inclusion in Lucene.

stefanvodita · 2024-04-23T15:47:05Z

Hi @sandeshkr419! I think it's a good idea to support richer aggregations at a lower level, in Lucene. If the OpenSearch community wants to migrate some of the aggregation functionality to Lucene and make it available for more people, that's great! Even just the cross-pollination of ideas between the projects should be useful.

mikemccand · 2024-05-10T16:44:03Z

+1 to cross-fertilize between OpenSearch's strong aggregations and Lucene's mostly-limited-to-counting (?) facets. If we cross-fertilize carefully, Lucene could provide the strong low level base API / building blocks for doing aggregations efficiently, working properly with modern Lucene features like intra-query concurrency (soon to also be decoupled from the index's segment geometry, I hope), first class query timeouts (IndexSearcher#setTimeout), optimizing when possible (maybe pulling from the Star Tree Index :) ), etc. OpenSearch then builds on this by parsing the incoming JSON aggregation request into the low level Lucene building blocks ...

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.

mikemccand · 2024-05-17T18:40:32Z

#13335 is an interesting example where Lucene could more efficiently implement faceting for numeric ranges using points, directly, instead of the two-step process facets uses today (collect into a bitset, then count from there based on doc values, I think?).

This was referenced Sep 29, 2023

Is it correct for facets to assume positive aggregation values? #12585

Closed

Compute multiple aggregations in one iteration of the match-set #12546

Open

stefanvodita mentioned this issue Dec 22, 2023

Reduce duplication in taxonomy facets; always do counts #12966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

Shradha26 commented Sep 13, 2023

msokolov commented Sep 19, 2023

Shradha26 commented Sep 29, 2023

stefanvodita commented Oct 6, 2023

sandeshkr419 commented Apr 22, 2024

stefanvodita commented Apr 23, 2024

mikemccand commented May 10, 2024

mikemccand commented May 17, 2024

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

Comments

Shradha26 commented Sep 13, 2023

Description

New features

msokolov commented Sep 19, 2023

Shradha26 commented Sep 29, 2023

stefanvodita commented Oct 6, 2023

sandeshkr419 commented Apr 22, 2024

stefanvodita commented Apr 23, 2024

mikemccand commented May 10, 2024

mikemccand commented May 17, 2024