Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting counts from "association" facets [LUCENE-10246] #11282

Open
asfimport opened this issue Nov 19, 2021 · 5 comments
Open

Support getting counts from "association" facets [LUCENE-10246] #11282

asfimport opened this issue Nov 19, 2021 · 5 comments

Comments

@asfimport
Copy link

We have these nice "association" facet implementations today that aggregate "weights" from the docs that facet over, but they don't keep track of counts. So the user can get "top-n" values for a dim by aggregated weight (great!), but can't know how many docs matched each value. It would be nice to support this so users could show the top-n values but also show counts associated with each.


Migrated from LUCENE-10246 by Greg Miller (@gsmiller), updated Jul 01 2022

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

FYI @msokolov, opened this issue based on your comments in PR #264

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Thanks, Greg! I think we might also want an option to report the aggregated weights, rather than the counts. Consider merging multiple such facet responses across a sharded  index; you'd need to preserve the aggregated weights in order to properly rank the facets across the entire index.

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

@msokolov yeah, that's a good point. This is something users can do today (get back aggregated weights) and it's definitely something we need to preserve. Sounds like there could be good use-cases though for wanting both counts and weights.

@asfimport
Copy link
Author

Rushabh Shah (@shahrs87) (migrated from JIRA)

@gsmiller @msokolov
I am pretty new to LUCENE project and want to contribute to small jiras to improve my lucene knowledge. Can you please help me scope the work required for this patch and point me to some relevant classes in the source code. Thank you.

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

@shahrs87 I'd start by becoming familiar with the existing "association facet" implementations (TaxonomyFacetIntAssociations and TaxonomyFacetFloatAssociations as well as looking at some demo code like AssociationsFacetsExample). The API contract they implement represent results with FacetResult, which contains a list of LabelAndValue instances. LabelAndValue only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey both an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in LabelAndValue and have both value and count be populated with a count for non-association cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used.

Please reach out here as you have questions and I'll do my best to answer in a timely fashion. Thanks for having a look at this!

stefanvodita added a commit that referenced this issue Apr 5, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit to stefanvodita/lucene that referenced this issue May 10, 2024
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
stefanvodita added a commit that referenced this issue May 14, 2024
#12966 (#13358)

Reduce duplication in taxonomy facets; always do counts (#12966)

This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets.

1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553.
2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576.
3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282.
4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585.
5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546.
6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded.
7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant