Field Collapsing/Combining #256

Closed
ppearcy opened this Issue Jul 13, 2010 · 244 comments

Comments

Projects
None yet
@ppearcy
Contributor

ppearcy commented Jul 13, 2010

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

@Omega359

This comment has been minimized.

Show comment
Hide comment
@Omega359

Omega359 Aug 13, 2010

Count this comment as a vote to have this feature added.

Count this comment as a vote to have this feature added.

@kwloafman

This comment has been minimized.

Show comment
Hide comment
@kwloafman

kwloafman Sep 5, 2010

I could make good use of this feature. Go for it!

I could make good use of this feature. Go for it!

@Fiedzia

This comment has been minimized.

Show comment
Hide comment
@Fiedzia

Fiedzia Sep 30, 2010

+1 vote for that

Fiedzia commented Sep 30, 2010

+1 vote for that

@ekalyoncu

This comment has been minimized.

Show comment
Hide comment
@ekalyoncu

ekalyoncu Oct 29, 2010

yes it's really cool feature.

yes it's really cool feature.

@ekalyoncu

This comment has been minimized.

Show comment
Hide comment
@ekalyoncu

ekalyoncu Oct 29, 2010

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

@giorgiovinci

This comment has been minimized.

Show comment
Hide comment
@giorgiovinci

giorgiovinci Oct 29, 2010

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

@jeroenr

This comment has been minimized.

Show comment
Hide comment
@jeroenr

jeroenr Nov 2, 2010

+1 This sounds really useful

jeroenr commented Nov 2, 2010

+1 This sounds really useful

@apatrida

This comment has been minimized.

Show comment
Hide comment
@apatrida

apatrida Nov 9, 2010

Contributor

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

Contributor

apatrida commented Nov 9, 2010

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

@Fiedzia

This comment has been minimized.

Show comment
Hide comment
@Fiedzia

Fiedzia Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

Fiedzia commented Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

@apatrida

This comment has been minimized.

Show comment
Hide comment
@apatrida

apatrida Nov 9, 2010

Contributor

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

Contributor

apatrida commented Nov 9, 2010

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

@ppearcy

This comment has been minimized.

Show comment
Hide comment
@ppearcy

ppearcy Dec 14, 2010

Contributor

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

Contributor

ppearcy commented Dec 14, 2010

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

@dmartinpro

This comment has been minimized.

Show comment
Hide comment
@dmartinpro

dmartinpro Apr 4, 2011

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

@till

This comment has been minimized.

Show comment
Hide comment
@till

till May 10, 2011

subscribe

till commented May 10, 2011

subscribe

@tfreitas

This comment has been minimized.

Show comment
Hide comment

+1

@vincenttheeten

This comment has been minimized.

Show comment
Hide comment
@vincenttheeten

vincenttheeten May 13, 2011

plz don't make us switch to SOLR just for this feature
+1

plz don't make us switch to SOLR just for this feature
+1

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy May 13, 2011

Member

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

Member

kimchy commented May 13, 2011

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

@till

This comment has been minimized.

Show comment
Hide comment
@till

till May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

till commented May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand May 18, 2011

Contributor

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

Contributor

mikemccand commented May 18, 2011

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy May 19, 2011

Member

Cool!, saw that a few days ago, will definitely have a look.

Member

kimchy commented May 19, 2011

Cool!, saw that a few days ago, will definitely have a look.

@tfreitas

This comment has been minimized.

Show comment
Hide comment
@tfreitas

tfreitas Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

tfreitas commented Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

@darxriggs

This comment has been minimized.

Show comment
Hide comment

+1

@aaronbinns

This comment has been minimized.

Show comment
Hide comment

+1

@0xPIT

This comment has been minimized.

Show comment
Hide comment

0xPIT commented Jun 13, 2011

++1

@mkreidenweis

This comment has been minimized.

Show comment
Hide comment

+1

@bbock

This comment has been minimized.

Show comment
Hide comment

bbock commented Jun 14, 2011

+1

@selaux

This comment has been minimized.

Show comment
Hide comment

selaux commented Jun 14, 2011

+1

@jmayr

This comment has been minimized.

Show comment
Hide comment

jmayr commented Jun 14, 2011

+1

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand Jun 14, 2011

Contributor

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

Contributor

mikemccand commented Jun 14, 2011

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

@spinscale

This comment has been minimized.

Show comment
Hide comment
Member

spinscale commented Jun 15, 2011

+1

@stevencasey

This comment has been minimized.

Show comment
Hide comment
@stevencasey

stevencasey Jun 17, 2011

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

@marvinthepa

This comment has been minimized.

Show comment
Hide comment

+1

@shtejv

This comment has been minimized.

Show comment
Hide comment

shtejv commented Jun 17, 2011

+1

@theone1984

This comment has been minimized.

Show comment
Hide comment

+1

@liebharc

This comment has been minimized.

Show comment
Hide comment

+1

@letier

This comment has been minimized.

Show comment
Hide comment

letier commented Jun 17, 2011

+1

@aparo

This comment has been minimized.

Show comment
Hide comment

aparo commented Jun 17, 2011

+1

@Karthago

This comment has been minimized.

Show comment
Hide comment

+1

@wolfs

This comment has been minimized.

Show comment
Hide comment

wolfs commented Jun 17, 2011

+1

@wuan

This comment has been minimized.

Show comment
Hide comment

wuan commented Jun 17, 2011

+1

@dachev

This comment has been minimized.

Show comment
Hide comment

dachev commented Jun 20, 2011

+1

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Jan 29, 2014

Contributor

Wait, did Simon just +1 this issue? :)

Contributor

brusic commented Jan 29, 2014

Wait, did Simon just +1 this issue? :)

@fulmicoton

This comment has been minimized.

Show comment
Hide comment
@fulmicoton

fulmicoton Jan 31, 2014

@s1monw Could you communicate on whether there is an ongoing project at elasticsearch on that? I was just thinking about resuming a grouping plugin project next weekend. I'd better drop it if the functionality will be shipped in next release.

@s1monw Could you communicate on whether there is an ongoing project at elasticsearch on that? I was just thinking about resuming a grouping plugin project next weekend. I'd better drop it if the functionality will be shipped in next release.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jan 31, 2014

Contributor

@poulejapon obviously this feature is of great demand so I am pretty sure that this is high prio as it always was. I can't make any promises when this will be implemented but I can promise it's not shipping in the next release and very unlikely in 1.1. This feature to be done right needs a reasonable refactoring on the search execution layer that is why we didn't crank it out already. The demand is I think obvious and there is no need for further +1 on it unless you really need to express yourself as I did since I think it's important. Stay tuned there is hope! :)

Contributor

s1monw commented Jan 31, 2014

@poulejapon obviously this feature is of great demand so I am pretty sure that this is high prio as it always was. I can't make any promises when this will be implemented but I can promise it's not shipping in the next release and very unlikely in 1.1. This feature to be done right needs a reasonable refactoring on the search execution layer that is why we didn't crank it out already. The demand is I think obvious and there is no need for further +1 on it unless you really need to express yourself as I did since I think it's important. Stay tuned there is hope! :)

@maddin4code

This comment has been minimized.

Show comment
Hide comment

+1

@SaSa1983

This comment has been minimized.

Show comment
Hide comment

SaSa1983 commented Mar 4, 2014

+1

@Kumen

This comment has been minimized.

Show comment
Hide comment

Kumen commented Mar 4, 2014

+1

@adri

This comment has been minimized.

Show comment
Hide comment

adri commented Mar 10, 2014

+1

@recurrence

This comment has been minimized.

Show comment
Hide comment

+1

@daniilyar

This comment has been minimized.

Show comment
Hide comment

+1

@davengeo

This comment has been minimized.

Show comment
Hide comment

+1

@bompi88

This comment has been minimized.

Show comment
Hide comment

bompi88 commented Mar 26, 2014

+1

@g00fy-

This comment has been minimized.

Show comment
Hide comment

g00fy- commented Mar 27, 2014

+1

@brupm

This comment has been minimized.

Show comment
Hide comment

brupm commented Apr 1, 2014

👍

@Firfi

This comment has been minimized.

Show comment
Hide comment

Firfi commented Apr 3, 2014

+1

@imarsman

This comment has been minimized.

Show comment
Hide comment
@imarsman

imarsman Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

imarsman commented Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

@petard

This comment has been minimized.

Show comment
Hide comment

petard commented Apr 4, 2014

+1

@grishick

This comment has been minimized.

Show comment
Hide comment
@grishick

grishick Apr 12, 2014

+1 this is a tie breaker for us right now when evaluating ES vs Solr

+1 this is a tie breaker for us right now when evaluating ES vs Solr

@Limfocit

This comment has been minimized.

Show comment
Hide comment

+1

@zeelax

This comment has been minimized.

Show comment
Hide comment

zeelax commented May 12, 2014

+1

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley May 12, 2014

Member

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

Member

clintongormley commented May 12, 2014

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

@thejohnfreeman

This comment has been minimized.

Show comment
Hide comment
@thejohnfreeman

thejohnfreeman May 13, 2014

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

@mattweber

This comment has been minimized.

Show comment
Hide comment
@mattweber

mattweber May 13, 2014

Contributor

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

Contributor

mattweber commented May 13, 2014

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg May 23, 2014

Member

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

Member

martijnvg commented May 23, 2014

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

@martijnvg martijnvg closed this May 23, 2014

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 23, 2014

What about paging? As far as I can tell, where is no way to page agg results.

What about paging? As far as I can tell, where is no way to page agg results.

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg May 23, 2014

Member

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

Member

martijnvg commented May 23, 2014

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic May 23, 2014

Contributor

+1

:)

Contributor

brusic commented May 23, 2014

+1

:)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 23, 2014

Cool!
You are awesome :)

Cool!
You are awesome :)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 25, 2014

should I add an issue for pagination?

should I add an issue for pagination?

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna May 26, 2014

Member

Hi @artemredkin we already have issue #6299 for it ;)

Member

javanna commented May 26, 2014

Hi @artemredkin we already have issue #6299 for it ;)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 26, 2014

Got it, thanks!

Got it, thanks!

@vvaradhan

This comment has been minimized.

Show comment
Hide comment
@vvaradhan

vvaradhan Jun 26, 2014

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

@SaSa1983

This comment has been minimized.

Show comment
Hide comment
@SaSa1983

SaSa1983 Jun 26, 2014

You can build the 1.3.0 branch

It contains the aggregations feature

You can build the 1.3.0 branch

It contains the aggregations feature

@dadoonet

This comment has been minimized.

Show comment
Hide comment
@mikemccabe

This comment has been minimized.

Show comment
Hide comment

Released in http://www.elasticsearch.org/downloads/1-3-0/ - elasticsearch#6124 is referenced in release notes.

@JnBrymn-EB

This comment has been minimized.

Show comment
Hide comment
@JnBrymn-EB

JnBrymn-EB Jun 14, 2015

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Jun 14, 2015

Contributor

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

Contributor

brusic commented Jun 14, 2015

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment