Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Collapsing/Combining #256

Closed
ppearcy opened this issue Jul 13, 2010 · 244 comments
Closed

Field Collapsing/Combining #256

ppearcy opened this issue Jul 13, 2010 · 244 comments

Comments

@ppearcy
Copy link
Contributor

@ppearcy ppearcy commented Jul 13, 2010

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

@Omega359
Copy link

@Omega359 Omega359 commented Aug 13, 2010

Count this comment as a vote to have this feature added.

@kwloafman
Copy link

@kwloafman kwloafman commented Sep 5, 2010

I could make good use of this feature. Go for it!

@Fiedzia
Copy link

@Fiedzia Fiedzia commented Sep 30, 2010

+1 vote for that

@ekalyoncu
Copy link

@ekalyoncu ekalyoncu commented Oct 29, 2010

yes it's really cool feature.

@ekalyoncu
Copy link

@ekalyoncu ekalyoncu commented Oct 29, 2010

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

@giorgiovinci
Copy link

@giorgiovinci giorgiovinci commented Oct 29, 2010

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

@jeroenr
Copy link

@jeroenr jeroenr commented Nov 2, 2010

+1 This sounds really useful

@apatrida
Copy link
Contributor

@apatrida apatrida commented Nov 9, 2010

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

@Fiedzia
Copy link

@Fiedzia Fiedzia commented Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

@apatrida
Copy link
Contributor

@apatrida apatrida commented Nov 9, 2010

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

@ppearcy
Copy link
Contributor Author

@ppearcy ppearcy commented Dec 14, 2010

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

@dmartinpro
Copy link

@dmartinpro dmartinpro commented Apr 4, 2011

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

@till
Copy link

@till till commented May 10, 2011

subscribe

@tfreitas
Copy link

@tfreitas tfreitas commented May 10, 2011

+1

@vincenttheeten
Copy link

@vincenttheeten vincenttheeten commented May 13, 2011

plz don't make us switch to SOLR just for this feature
+1

@kimchy
Copy link
Member

@kimchy kimchy commented May 13, 2011

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

@till
Copy link

@till till commented May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

@mikemccand
Copy link
Contributor

@mikemccand mikemccand commented May 18, 2011

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

@kimchy
Copy link
Member

@kimchy kimchy commented May 19, 2011

Cool!, saw that a few days ago, will definitely have a look.

@tfreitas
Copy link

@tfreitas tfreitas commented Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

@darxriggs
Copy link

@darxriggs darxriggs commented Jun 11, 2011

+1

1 similar comment
@aaronbinns
Copy link

@aaronbinns aaronbinns commented Jun 13, 2011

+1

@0xPIT
Copy link

@0xPIT 0xPIT commented Jun 13, 2011

++1

@mkreidenweis
Copy link

@mkreidenweis mkreidenweis commented Jun 14, 2011

+1

3 similar comments
@bbock
Copy link

@bbock bbock commented Jun 14, 2011

+1

@selaux
Copy link

@selaux selaux commented Jun 14, 2011

+1

@jmayr
Copy link

@jmayr jmayr commented Jun 14, 2011

+1

@mikemccand
Copy link
Contributor

@mikemccand mikemccand commented Jun 14, 2011

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

@spinscale
Copy link
Member

@spinscale spinscale commented Jun 15, 2011

+1

@stevencasey
Copy link

@stevencasey stevencasey commented Jun 17, 2011

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

@g00fy-
Copy link

@g00fy- g00fy- commented Mar 27, 2014

+1

2 similar comments
@brupm
Copy link

@brupm brupm commented Apr 1, 2014

👍

@Firfi
Copy link

@Firfi Firfi commented Apr 3, 2014

+1

@imarsman
Copy link

@imarsman imarsman commented Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

@petard
Copy link

@petard petard commented Apr 4, 2014

+1

@grishick
Copy link

@grishick grishick commented Apr 12, 2014

+1 this is a tie breaker for us right now when evaluating ES vs Solr

@Limfocit
Copy link

@Limfocit Limfocit commented Apr 29, 2014

+1

@zeelax
Copy link

@zeelax zeelax commented May 12, 2014

+1

@clintongormley
Copy link
Contributor

@clintongormley clintongormley commented May 12, 2014

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

@thejohnfreeman
Copy link

@thejohnfreeman thejohnfreeman commented May 13, 2014

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

@mattweber
Copy link
Contributor

@mattweber mattweber commented May 13, 2014

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

@martijnvg
Copy link
Member

@martijnvg martijnvg commented May 23, 2014

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

@martijnvg martijnvg closed this May 23, 2014
@artemredkin
Copy link

@artemredkin artemredkin commented May 23, 2014

What about paging? As far as I can tell, where is no way to page agg results.

@martijnvg
Copy link
Member

@martijnvg martijnvg commented May 23, 2014

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

@brusic
Copy link
Contributor

@brusic brusic commented May 23, 2014

+1

:)

@artemredkin
Copy link

@artemredkin artemredkin commented May 23, 2014

Cool!
You are awesome :)

@artemredkin
Copy link

@artemredkin artemredkin commented May 25, 2014

should I add an issue for pagination?

@javanna
Copy link
Member

@javanna javanna commented May 26, 2014

Hi @artemredkin we already have issue #6299 for it ;)

@artemredkin
Copy link

@artemredkin artemredkin commented May 26, 2014

Got it, thanks!

@vvaradhan
Copy link

@vvaradhan vvaradhan commented Jun 26, 2014

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

@SaSa1983
Copy link

@SaSa1983 SaSa1983 commented Jun 26, 2014

You can build the 1.3.0 branch

It contains the aggregations feature

@dadoonet
Copy link
Member

@dadoonet dadoonet commented Jun 26, 2014

@mikemccabe
Copy link

@mikemccabe mikemccabe commented Aug 13, 2014

Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes.

@JnBrymn-EB
Copy link

@JnBrymn-EB JnBrymn-EB commented Jun 14, 2015

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

@brusic
Copy link
Contributor

@brusic brusic commented Jun 14, 2015

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

williamrandolph pushed a commit to williamrandolph/elasticsearch that referenced this issue Jun 4, 2020
Due to the way the settings are used, the pinned node for a certain
task is being ignored and the list of discovered nodes is used
instead. This commit addresses that and separates the internal
properties used by the library to avoid overloading them (and thus
leading to errors).

Fix elastic#256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet