-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post grouping faceting [LUCENE-3097] #4170
Comments
Michael McCandless (@mikemccand) (migrated from JIRA)
+1, I think that semantics is intuitive. It treats each group as a doc w/ multi-valued field whose values are unioned from all docs within it. So group "Hotel a" has values AMS, DUS for the departure_airport field. |
Bill Bell (migrated from JIRA) Here is another example... Doctors have multiple offices. I want to store doctorid, doctor's name, gender (male/female), and office address as separate rows. Then I want to group by doctorid. I only want the one doctor. I then want to facet by gender and see the numbers after it is grouped. I also want the total rows to be after grouping. doctorid, doctor's name, gender, address I would expect the grouping to return: total rows = 7 I would expect if I say, rows=2, start=0, order by doctorid, I would get: 1, If I say, facet.field=gender I would expect: male: 2 (Bill Bell, Toby Williams) If we had Spatial, and I had lat long for each address, I would expect if I say sort=geodist() asc that it would group and then find the closest group_by If I only need the 1st point in the grouping I would expect the other points to be omitted. group_by Thanks. |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks for the example Bill – that makes sense! I think, in general, the post-group faceting should act "as if" you had indexed a single document per group, with multi-valued fields containing the union of all field values within that group, and then done "normal" faceting. I believe this defines the semantics we are after for post-grouping faceting. |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
I think this can be achieved by basing the facet counts on the normal documents. Ungrouped counts.
This just depends on the sort / group sort you provide. I think this should already work in the Solr trunk.
This depends on the group limit you provide in the request. |
Michael McCandless (@mikemccand) (migrated from JIRA) Right, gender in this example was single-valued per group. Another way to visualize / define how post-group faceting should behave is: imagine for ever facet value (ie field + value) you could define an aggregator. Today, that aggregator is just the count of how many docs had that value from the full result set. But you could, instead define it to be "count(distinct(doctor_id))", and then you'll get the group counts you want. (Other aggregators are conceivable – max(relevance), min+max(prices), etc.). Conceptually I think this also defines the post-group faceting functionality, even if we would never implement it this way (ie count(distinct(doctor_id)) would be way too costly to do naively). |
Michael McCandless (@mikemccand) (migrated from JIRA) In fact, I think a very efficient way to implement post-group faceting is something like #3528. Ie, we just have to insure, at indexing time, that docs within the same "group" are adjacent, if you want to be able to count by unique group values. Hmm... but I think this (what your "identifier" field is, for facet counting purposes) should be decoupled from how you group. I may group by State, for presentation purposes, but count facets by doctor_id. |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
This means that in the same group also need to be in the same segment, right? Or if we use this mechanism for faceting documents with the same facet need to be in the same segment??? If that is true, it would make the collectors easier. The SentinelIntSet we use in the collectors is not necessary, because we can lookup the norm from the DocIndexTerms. We won't find the same group in a different segment. On the other hand with scalability in mind would make it complex. Since documents with the in the same group need to be in the same segment. Which makes indexing complex. |
Michael McCandless (@mikemccand) (migrated from JIRA) Right, this'd mean all docs sharing a given group value are contiguous and in the same segment. The app would have to ensure this, in order to use a collector that takes advantage of it. |
Simon Willnauer (@s1monw) (migrated from JIRA) Martjin, you should assigne this issue to you to make sure its not moved to version 3.3 |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Attached an initial patch with a collector that collects the most relevant documents for each group that match the query. This collector can be used to create facets based on grouped counts. Actually the collector has many implementations for different situations. For example when the group sort within the groups is only score or fields. There is a general implementation that works for all sorts (e.g. a function). Just as in the caching collector there is a factory method that selects the most efficient collector based on the group sort. TODO:
Feedback welcome! |
Michael McCandless (@mikemccand) (migrated from JIRA) Patch looks good Martijn! A few small things:
This would really benefit from the random test in TestGrouping :) This can indeed help with post-facet counting, but I think only on Once docs within one can have different values for field X then we |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Oops I forgot. We need to use the general impl for that.
I think I'm doing that. If you look at the updateHead() methods. You see that I rebasing the ids.
But that would only happen when if an update happen during a search? Then all collectors can have this problem, right? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Ahh excellent, I missed that. Looks good!
This is independent of updating during search I think. I don't think the existing collectors have a problem here? Ie the It's only because we intend for these new group collectors to make Say this is my original content:
But, I'm not using nested docs (#3528), so I had to fully
Now, if user does a search for "color=red"... without post-group With post-group faceting, you should see count=2 for color=red (which |
Bill Bell (migrated from JIRA) One way to do this would be to treat each grouping as unique fields. That would solve both use cases: My use case would work for top doc per group, but I can see that the counting looks for "unique values in the field per group". So your example would "look like for counting" for color: name=3-wolf shirt
color=red
color=blue
name=frog shirt
color=white
color=red color For size the counting looks like: name=3-wolf shirt
size=M, color=red
size=S, color=red
size=L, color=blue
name=frog shirt
size=M, color=white
size=S, color=red size And the facets for size would not change for: name=3-wolf shirt
size=M, color=red
size=S, color=red
size=L, color=blue
size=S, color=blue
size=S, color=blue
size=L, color=blue
name=frog shirt
size=M, color=white
size=S, color=red Thanks. |
Michael McCandless (@mikemccand) (migrated from JIRA) Right, I think for post-grouping facet counts, the facet counting |
Bill Bell (migrated from JIRA) OK... This issue seems stalled? Are we waiting on something else? |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
For the current attached patch I think that we first need to have the same abstraction as the collectors in #4172 have. I think that it can be committed. After that we only need to wire it up in Solr (I'll open a new issue for that). Then we have the same behavior as in SOLR-236 patch with the facet.after option. Don't worry we'll get this soon! This patch only support computing the grouped counts. Not the other the other count variant. I think for that we also depend on the faceting module. |
Robert Muir (@rmuir) (migrated from JIRA) bulk move 3.2 -> 3.3 |
Michael McCandless (@mikemccand) (migrated from JIRA) Also, this patch won't properly count facets if the field ever has multiple values within one group. But maybe that's fine for the first go.... progress not perfection. |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
That is true. If facet values are different within a group the current collectors in the patch won't notice that.
Definitely! But to continue I think we need the facet module. |
Martijn van Groningen (@martijnvg) (migrated from JIRA) An updated version of the patch. This is still work in progress. I basically rewrote the code in the same way as the other collectors were rewritten for #4172. Things todo are creating tests and add some more documentation. This patch only covers the second facet / grouping method. |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Attached updated patch.
Things to be done:
|
Martijn van Groningen (@martijnvg) (migrated from JIRA) Updated the patch.
I think this collector is ready to be committed. This collector implements the second grouping / faceting case that I've described in the issue description. |
Michael McCandless (@mikemccand) (migrated from JIRA) Hmm I hit a test failure w/ this patch:
Also: can this collector use the new FixedBitSet instead of OpenBitSet...? |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Updated the patch.
I think the patch is now ready to be committed! |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Updated patch again. I forgot to update some jdocs. |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Committed to trunk (rev. 1150470) and 3x branch (rev. 1150472). I'll keep this issue open for future developments. |
Michael McCandless (@mikemccand) (migrated from JIRA) The package.html still references OpenBitSet here? |
Martijn van Groningen (@martijnvg) (migrated from JIRA) I've fixed that. It now references FixedBitSet. |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks/ Woops – also the comment in the java code in the package.html still says OpenBitSet! |
Bill Bell (migrated from JIRA) Set this to resolved? |
Simon Willnauer (@s1monw) (migrated from JIRA) martjin, is this done? seems like you committed to 3.x and trunk. if so can you close/resolve this issue? |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Well the code that got committed only creates facets for the most relevant document per group. This isn't really grouped facets. To implement this we need to modify Solr's faceting code / facet module code. So I think we can close this one and open a Solr issue to implement grouped facets in Solr (I do have some code for this, but it isn't perfect...) and maybe also an issue to add this to the faceting module |
Martijn van Groningen (@martijnvg) (migrated from JIRA) The support for real grouped faceting (matrix counts) needs to be added to Solr or faceting module. |
Ian Grainger (migrated from JIRA) Hi - is the matrix count feature available in Solr 3.5? Seeing as this is marked as closed I assume it is? If so do I need to do anything to use this feature? |
Ian Grainger (migrated from JIRA) Oh, sorry- I just read the previous comment properly - So the case I need fixing is SOLR-2898? |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Yes, if you're using Solr. You can try to apply the patch it should work for field facets. |
This issues focuses on implementing post grouping faceting.
And properly more implementation options.
The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics.
This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers:
If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet:
AMS: 2
DUS: 1
The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports.
Migrated from LUCENE-3097 by Martijn van Groningen (@martijnvg), 5 votes, resolved Nov 14 2011
Attachments: LUCENE-3097.patch (versions: 5), LUCENE-30971.patch
Linked issues:
The text was updated successfully, but these errors were encountered: