New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Collapsing/Combining #256

Closed
ppearcy opened this Issue Jul 13, 2010 · 244 comments

Comments

Projects
None yet
@ppearcy
Contributor

ppearcy commented Jul 13, 2010

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

@Omega359

This comment has been minimized.

Show comment
Hide comment
@Omega359

Omega359 Aug 13, 2010

Count this comment as a vote to have this feature added.

Omega359 commented Aug 13, 2010

Count this comment as a vote to have this feature added.

@kwloafman

This comment has been minimized.

Show comment
Hide comment
@kwloafman

kwloafman Sep 5, 2010

I could make good use of this feature. Go for it!

kwloafman commented Sep 5, 2010

I could make good use of this feature. Go for it!

@Fiedzia

This comment has been minimized.

Show comment
Hide comment
@Fiedzia

Fiedzia Sep 30, 2010

+1 vote for that

Fiedzia commented Sep 30, 2010

+1 vote for that

@ekalyoncu

This comment has been minimized.

Show comment
Hide comment
@ekalyoncu

ekalyoncu Oct 29, 2010

yes it's really cool feature.

ekalyoncu commented Oct 29, 2010

yes it's really cool feature.

@ekalyoncu

This comment has been minimized.

Show comment
Hide comment
@ekalyoncu

ekalyoncu Oct 29, 2010

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

ekalyoncu commented Oct 29, 2010

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

@giorgiovinci

This comment has been minimized.

Show comment
Hide comment
@giorgiovinci

giorgiovinci Oct 29, 2010

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

giorgiovinci commented Oct 29, 2010

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

@jeroenr

This comment has been minimized.

Show comment
Hide comment
@jeroenr

jeroenr Nov 2, 2010

+1 This sounds really useful

jeroenr commented Nov 2, 2010

+1 This sounds really useful

@apatrida

This comment has been minimized.

Show comment
Hide comment
@apatrida

apatrida Nov 9, 2010

Contributor

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

Contributor

apatrida commented Nov 9, 2010

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

@Fiedzia

This comment has been minimized.

Show comment
Hide comment
@Fiedzia

Fiedzia Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

Fiedzia commented Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

@apatrida

This comment has been minimized.

Show comment
Hide comment
@apatrida

apatrida Nov 9, 2010

Contributor

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

Contributor

apatrida commented Nov 9, 2010

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

@ppearcy

This comment has been minimized.

Show comment
Hide comment
@ppearcy

ppearcy Dec 14, 2010

Contributor

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

Contributor

ppearcy commented Dec 14, 2010

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

@dmartinpro

This comment has been minimized.

Show comment
Hide comment
@dmartinpro

dmartinpro Apr 4, 2011

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

dmartinpro commented Apr 4, 2011

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

@till

This comment has been minimized.

Show comment
Hide comment
@till

till May 10, 2011

subscribe

till commented May 10, 2011

subscribe

@tfreitas

This comment has been minimized.

Show comment
Hide comment
@tfreitas

tfreitas commented May 10, 2011

+1

@vincenttheeten

This comment has been minimized.

Show comment
Hide comment
@vincenttheeten

vincenttheeten May 13, 2011

plz don't make us switch to SOLR just for this feature
+1

vincenttheeten commented May 13, 2011

plz don't make us switch to SOLR just for this feature
+1

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy May 13, 2011

Member

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

Member

kimchy commented May 13, 2011

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

@till

This comment has been minimized.

Show comment
Hide comment
@till

till May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

till commented May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand May 18, 2011

Contributor

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

Contributor

mikemccand commented May 18, 2011

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy May 19, 2011

Member

Cool!, saw that a few days ago, will definitely have a look.

Member

kimchy commented May 19, 2011

Cool!, saw that a few days ago, will definitely have a look.

@tfreitas

This comment has been minimized.

Show comment
Hide comment
@tfreitas

tfreitas Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

tfreitas commented Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

@darxriggs

This comment has been minimized.

Show comment
Hide comment
@darxriggs

darxriggs commented Jun 11, 2011

+1

1 similar comment
@aaronbinns

This comment has been minimized.

Show comment
Hide comment
@aaronbinns

aaronbinns commented Jun 13, 2011

+1

@0xPIT

This comment has been minimized.

Show comment
Hide comment
@0xPIT

0xPIT commented Jun 13, 2011

++1

@mkreidenweis

This comment has been minimized.

Show comment
Hide comment
@mkreidenweis

mkreidenweis commented Jun 14, 2011

+1

3 similar comments
@bbock

This comment has been minimized.

Show comment
Hide comment
@bbock

bbock commented Jun 14, 2011

+1

@selaux

This comment has been minimized.

Show comment
Hide comment
@selaux

selaux commented Jun 14, 2011

+1

@jmayr

This comment has been minimized.

Show comment
Hide comment
@jmayr

jmayr commented Jun 14, 2011

+1

@mikemccand

This comment has been minimized.

Show comment
Hide comment
@mikemccand

mikemccand Jun 14, 2011

Contributor

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

Contributor

mikemccand commented Jun 14, 2011

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

@spinscale

This comment has been minimized.

Show comment
Hide comment
@spinscale
Member

spinscale commented Jun 15, 2011

+1

@stevencasey

This comment has been minimized.

Show comment
Hide comment
@stevencasey

stevencasey Jun 17, 2011

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

stevencasey commented Jun 17, 2011

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

@bompi88

This comment has been minimized.

Show comment
Hide comment
@bompi88

bompi88 commented Mar 26, 2014

+1

3 similar comments
@g00fy-

This comment has been minimized.

Show comment
Hide comment
@g00fy-

g00fy- commented Mar 27, 2014

+1

@brupm

This comment has been minimized.

Show comment
Hide comment
@brupm

brupm commented Apr 1, 2014

👍

@Firfi

This comment has been minimized.

Show comment
Hide comment
@Firfi

Firfi commented Apr 3, 2014

+1

@imarsman

This comment has been minimized.

Show comment
Hide comment
@imarsman

imarsman Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

imarsman commented Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

@petard

This comment has been minimized.

Show comment
Hide comment
@petard

petard commented Apr 4, 2014

+1

@grishick

This comment has been minimized.

Show comment
Hide comment
@grishick

grishick Apr 12, 2014

+1 this is a tie breaker for us right now when evaluating ES vs Solr

grishick commented Apr 12, 2014

+1 this is a tie breaker for us right now when evaluating ES vs Solr

@Limfocit

This comment has been minimized.

Show comment
Hide comment
@Limfocit

Limfocit commented Apr 29, 2014

+1

@zeelax

This comment has been minimized.

Show comment
Hide comment
@zeelax

zeelax commented May 12, 2014

+1

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley May 12, 2014

Member

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

Member

clintongormley commented May 12, 2014

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

@thejohnfreeman

This comment has been minimized.

Show comment
Hide comment
@thejohnfreeman

thejohnfreeman May 13, 2014

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

thejohnfreeman commented May 13, 2014

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

@mattweber

This comment has been minimized.

Show comment
Hide comment
@mattweber

mattweber May 13, 2014

Contributor

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

Contributor

mattweber commented May 13, 2014

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg May 23, 2014

Member

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

Member

martijnvg commented May 23, 2014

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

@martijnvg martijnvg closed this May 23, 2014

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 23, 2014

What about paging? As far as I can tell, where is no way to page agg results.

artemredkin commented May 23, 2014

What about paging? As far as I can tell, where is no way to page agg results.

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg May 23, 2014

Member

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

Member

martijnvg commented May 23, 2014

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic May 23, 2014

Contributor

+1

:)

Contributor

brusic commented May 23, 2014

+1

:)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 23, 2014

Cool!
You are awesome :)

artemredkin commented May 23, 2014

Cool!
You are awesome :)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 25, 2014

should I add an issue for pagination?

artemredkin commented May 25, 2014

should I add an issue for pagination?

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna May 26, 2014

Member

Hi @artemredkin we already have issue #6299 for it ;)

Member

javanna commented May 26, 2014

Hi @artemredkin we already have issue #6299 for it ;)

@artemredkin

This comment has been minimized.

Show comment
Hide comment
@artemredkin

artemredkin May 26, 2014

Got it, thanks!

artemredkin commented May 26, 2014

Got it, thanks!

@vvaradhan

This comment has been minimized.

Show comment
Hide comment
@vvaradhan

vvaradhan Jun 26, 2014

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

vvaradhan commented Jun 26, 2014

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

@SaSa1983

This comment has been minimized.

Show comment
Hide comment
@SaSa1983

SaSa1983 Jun 26, 2014

You can build the 1.3.0 branch

It contains the aggregations feature

SaSa1983 commented Jun 26, 2014

You can build the 1.3.0 branch

It contains the aggregations feature

@dadoonet

This comment has been minimized.

Show comment
Hide comment
@dadoonet
Member

dadoonet commented Jun 26, 2014

@mikemccabe

This comment has been minimized.

Show comment
Hide comment
@mikemccabe

mikemccabe Aug 13, 2014

Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes.

mikemccabe commented Aug 13, 2014

Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes.

@JnBrymn-EB

This comment has been minimized.

Show comment
Hide comment
@JnBrymn-EB

JnBrymn-EB Jun 14, 2015

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

JnBrymn-EB commented Jun 14, 2015

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Jun 14, 2015

Contributor

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

Contributor

brusic commented Jun 14, 2015

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment