Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Collapsing/Combining #256

Closed
ppearcy opened this issue Jul 13, 2010 · 244 comments
Closed

Field Collapsing/Combining #256

ppearcy opened this issue Jul 13, 2010 · 244 comments

Comments

@ppearcy
Copy link
Contributor

ppearcy commented Jul 13, 2010

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

@Omega359
Copy link

Count this comment as a vote to have this feature added.

@kwloafman
Copy link

I could make good use of this feature. Go for it!

@Fiedzia
Copy link

Fiedzia commented Sep 30, 2010

+1 vote for that

@ekalyoncu
Copy link

yes it's really cool feature.

@ekalyoncu
Copy link

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

@giorgiovinci
Copy link

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

@jeroenr
Copy link

jeroenr commented Nov 2, 2010

+1 This sounds really useful

@apatrida
Copy link
Contributor

apatrida commented Nov 9, 2010

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

@Fiedzia
Copy link

Fiedzia commented Nov 9, 2010

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

@apatrida
Copy link
Contributor

apatrida commented Nov 9, 2010

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

@ppearcy
Copy link
Contributor Author

ppearcy commented Dec 14, 2010

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

@dmartinpro
Copy link

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

@till
Copy link

till commented May 10, 2011

subscribe

@tfreitas
Copy link

+1

@vincenttheeten
Copy link

plz don't make us switch to SOLR just for this feature
+1

@kimchy
Copy link
Member

kimchy commented May 13, 2011

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

@till
Copy link

till commented May 13, 2011

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

@mikemccand
Copy link
Contributor

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

@kimchy
Copy link
Member

kimchy commented May 19, 2011

Cool!, saw that a few days ago, will definitely have a look.

@tfreitas
Copy link

tfreitas commented Jun 3, 2011

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

@darxriggs
Copy link
Contributor

+1

1 similar comment
@aaronbinns
Copy link

+1

@0xPIT
Copy link

0xPIT commented Jun 13, 2011

++1

@mkreidenweis
Copy link

+1

3 similar comments
@bbock
Copy link

bbock commented Jun 14, 2011

+1

@selaux
Copy link

selaux commented Jun 14, 2011

+1

@jmayr
Copy link

jmayr commented Jun 14, 2011

+1

@mikemccand
Copy link
Contributor

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

@spinscale
Copy link
Contributor

+1

@stevencasey
Copy link

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

@dearlordylord
Copy link

+1

@imarsman
Copy link

imarsman commented Apr 4, 2014

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

@petard
Copy link

petard commented Apr 4, 2014

+1

@grishick
Copy link

+1 this is a tie breaker for us right now when evaluating ES vs Solr

@Limfocit
Copy link

+1

@zeelax
Copy link

zeelax commented May 12, 2014

+1

@clintongormley
Copy link

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

@thejohnfreeman
Copy link

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

@mattweber
Copy link
Contributor

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

@martijnvg
Copy link
Member

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

@artemredkin
Copy link

What about paging? As far as I can tell, where is no way to page agg results.

@martijnvg
Copy link
Member

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

@brusic
Copy link
Contributor

brusic commented May 23, 2014

+1

:)

@artemredkin
Copy link

Cool!
You are awesome :)

@artemredkin
Copy link

should I add an issue for pagination?

@javanna
Copy link
Member

javanna commented May 26, 2014

Hi @artemredkin we already have issue #6299 for it ;)

@artemredkin
Copy link

Got it, thanks!

@vvaradhan
Copy link

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

@SaSa1983
Copy link

You can build the 1.3.0 branch

It contains the aggregations feature

@dadoonet
Copy link
Member

@vvaradhan 1.3.0-SNAPSHOT is available on Sonatype repo: https://oss.sonatype.org/#nexus-search;gav~org.elasticsearch~elasticsearch~1.3.0-SNAPSHOT~~

HTH

@mikemccabe
Copy link

Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes.

@JnBrymn-EB
Copy link

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

@brusic
Copy link
Contributor

brusic commented Jun 14, 2015

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?


Reply to this email directly or view it on GitHub
#256 (comment)
.

williamrandolph pushed a commit to williamrandolph/elasticsearch that referenced this issue Jun 4, 2020
Due to the way the settings are used, the pinned node for a certain
task is being ignored and the list of discovered nodes is used
instead. This commit addresses that and separates the internal
properties used by the library to avoid overloading them (and thus
leading to errors).

Fix elastic#256
costin pushed a commit that referenced this issue Dec 6, 2022
🤖 ESQL: Merge upstream
emilykmarx pushed a commit to emilykmarx/elasticsearch that referenced this issue Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests