Field Collapsing/Combining #256

Closed
ppearcy opened this Issue Jul 13, 2010 · 244 comments

Comments

Projects
None yet
Contributor

ppearcy commented Jul 13, 2010

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

Count this comment as a vote to have this feature added.

I could make good use of this feature. Go for it!

Fiedzia commented Sep 30, 2010

+1 vote for that

yes it's really cool feature.

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

jeroenr commented Nov 2, 2010

+1 This sounds really useful

Contributor

apatrida commented Nov 9, 2010

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

Fiedzia commented Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

Contributor

apatrida commented Nov 9, 2010

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

Contributor

ppearcy commented Dec 14, 2010

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

till commented May 10, 2011

subscribe

+1

plz don't make us switch to SOLR just for this feature
+1

Owner

kimchy commented May 13, 2011

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

till commented May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

Contributor

mikemccand commented May 18, 2011

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

Owner

kimchy commented May 19, 2011

Cool!, saw that a few days ago, will definitely have a look.

tfreitas commented Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

+1

0xPIT commented Jun 13, 2011

++1

bbock commented Jun 14, 2011

+1

selaux commented Jun 14, 2011

+1

jmayr commented Jun 14, 2011

+1

Contributor

mikemccand commented Jun 14, 2011

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

Member

spinscale commented Jun 15, 2011

+1

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

shtejv commented Jun 17, 2011

+1

+1

letier commented Jun 17, 2011

+1

aparo commented Jun 17, 2011

+1

+1

wolfs commented Jun 17, 2011

+1

wuan commented Jun 17, 2011

+1

dachev commented Jun 20, 2011

+1

mbj commented Jun 24, 2011

+1

+1

Contributor

ofavre commented Jun 28, 2011

+∞

Is this being worked on?
This is the only thing that keeps the company i am working for from using it at the moment.
We need it to get "unique" headers from news articles.
We could make our own frontend that does this, but we rather have all search, sort and folding in the same software.
I can understand that this can be a problematic thing in a cluster when all results are not known.

How about this for a solution:
"Field Collapsing" the results in the nodes using Lucene functionality, to reduce the amount of data to be transported.
Then on the node that received the request from the client you do you own "Field Collapsing" when combining the results.

Hope it helps.

Contributor

ofavre commented Jul 4, 2011

Lucene 3.3 has improved its grouping (more abstract and multiple response per groups, mainly).
A few commits ago, ES has switched to Lucene 3.3 for upcoming version 0.17.
This is good news!

Any idea how long this might take to implement? / Any update status of what still needs to be solved?
Thanks

Owner

kimchy commented Jul 4, 2011

Heya, an update on this: I plan to try and tackle this in the next version, see how it goes. The new lucene version does come with grouping support (though, its not going to be tremendously fast, and require more memory). The change requires some internal changes in elasticsearch to represent the fact that grouping is being performed, how to represent it, and get it hooked into the internal single shard search, and distributed search.

+1

+1. Our use case is...property search results which might contain properties for a new Development (large piece of land being built on by a Developer) which might have properties (Plots) of more than one Style. Properties with same Style might have a different price because they might have slightly bigger garden, etc. We would want to offer the user the ability to collapse results on Development and Style. So if a Development had 100 properties containing 5 styles each style with 20 properties we would expect to see 5 items in the results which we would render in the results differently to indicated number of properties and price range.

I'm out of the office until August 1st 2011.
I will have limited access to my mailbox.
For urgent matters, please contact CREAX directly at +32 57 22 94 80.

Best regards,
Vincent

nahap commented Jul 25, 2011

+1

+1

Contributor

mattweber commented Sep 1, 2011

+1

kcheang commented Sep 18, 2011

+1

Member

medcl commented Sep 19, 2011

+1

Contributor

jprante commented Nov 4, 2011

+1

@Shay do you have any updates on this?

I noticed https://issues.apache.org/jira/browse/SOLR-2066

Contributor

karussell commented Nov 5, 2011

hey all, also have a look at child/parent feature ... http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

electic commented Nov 20, 2011

+1 de-duping would be nice.

saxxi commented Sep 11, 2013

@phungleson - I've updated my stackoverflow answer and gist. I'm thinking on joining soon the IRC community as well, I hope to see you there too.

@wojons Collapsing is implemented that way in several enterprise search engine. One of the main problem is that no matter how many documents will be returned by the shards, it might not be sufficient to fill your page of hit after postprocessing. When collapsing for diversity (like a search engine like google does) it is not such a big problem to collapse the first result and then fallback to uncollapsed search. If your use case is for instance to group result per category in a e-commerce search engine, this may not be an option for you.

If some of you are interested in the way it is done in Solr, I described it in a blog post : http://fulmicoton.com/posts/grouping-in-solr/

mishu- commented Sep 17, 2013

+1 :)

+1 This just what I need +1000

edigu commented Oct 16, 2013

+1

Contributor

brusic commented Oct 16, 2013

The issue of field collapsing was address slightly in this blog post: http://www.elasticsearch.com/blog/from-amsterdam-with-love-elasticsearchs-second-company-all-hands/

"We again fleshed out what is needed in order to properly support field collapsing in a distributed environment execution, as well as the ability to get inner hits (for nested / parent child cases). We have a good idea on the type of refactoring we need in our search execution infrastructure, and hope to tackle it post 1.0."

The elasticsearch team is working on it and the timetable is somewhere post 1.0. You can now stop with the +1s. :)

scharf commented Oct 17, 2013

+1
to make "somewhere post 1.0" not too long after 1.0 ;-)

+1

mlpinit commented Oct 18, 2013

+1 :)

+1:)

ymost commented Nov 21, 2013

+1

I've been reading http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-aggregations.html,
it says
"Bucketing
A family of aggregations that build buckets, where each bucket is associated with a key and a document criteria. When the aggregations is executed, the buckets criterias are evaluated on every document in the context and when matches, the document is considered to "fall in" the relevant bucket. By the end of the aggreagation process, we’ll end up with a list of buckets - each one with a set of documents that "belong" to it."

for each document returned, it would have a score, is it possible to select only top 5 results from each bucket and all together order by scores of these documents?
(Real case scenario is a search engine only wants to select 5 highest score documents that belongs to each owner, but all the selected documents need to be put together and determine which document displays first)

thanks

dmr commented Jan 15, 2014

+1

Contributor

s1monw commented Jan 29, 2014

+1

Contributor

brusic commented Jan 29, 2014

Wait, did Simon just +1 this issue? :)

@s1monw Could you communicate on whether there is an ongoing project at elasticsearch on that? I was just thinking about resuming a grouping plugin project next weekend. I'd better drop it if the functionality will be shipped in next release.

Contributor

s1monw commented Jan 31, 2014

@poulejapon obviously this feature is of great demand so I am pretty sure that this is high prio as it always was. I can't make any promises when this will be implemented but I can promise it's not shipping in the next release and very unlikely in 1.1. This feature to be done right needs a reasonable refactoring on the search execution layer that is why we didn't crank it out already. The demand is I think obvious and there is no need for further +1 on it unless you really need to express yourself as I did since I think it's important. Stay tuned there is hope! :)

SaSa1983 commented Mar 4, 2014

+1

Kumen commented Mar 4, 2014

+1

adri commented Mar 10, 2014

+1

+1

+1

bompi88 commented Mar 26, 2014

+1

g00fy- commented Mar 27, 2014

+1

brupm commented Apr 1, 2014

👍

Firfi commented Apr 3, 2014

+1

imarsman commented Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

petard commented Apr 4, 2014

+1

+1 this is a tie breaker for us right now when evaluating ES vs Solr

+1

zeelax commented May 12, 2014

+1

Owner

clintongormley commented May 12, 2014

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

Contributor

mattweber commented May 13, 2014

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

Member

martijnvg commented May 23, 2014

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

@martijnvg martijnvg closed this May 23, 2014

What about paging? As far as I can tell, where is no way to page agg results.

Member

martijnvg commented May 23, 2014

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

Contributor

brusic commented May 23, 2014

+1

:)

Cool!
You are awesome :)

should I add an issue for pagination?

Member

javanna commented May 26, 2014

Hi @artemredkin we already have issue #6299 for it ;)

Got it, thanks!

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

You can build the 1.3.0 branch

It contains the aggregations feature

Released in http://www.elasticsearch.org/downloads/1-3-0/ - elasticsearch#6124 is referenced in release notes.

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

Contributor

brusic commented Jun 14, 2015

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment