New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance aggregation support #276

Open
costin opened this Issue Sep 21, 2014 · 29 comments

Comments

Projects
None yet
@costin
Member

costin commented Sep 21, 2014

Currently due to the use of scan/scroll API, the results in aggregations are not properly returned. To cope with this, a mixture of count and scroll should be used.

@devoncrouse

This comment has been minimized.

Show comment
Hide comment
@devoncrouse

devoncrouse Aug 27, 2015

+1; was hoping to pull an aggregation result back to Spark.

devoncrouse commented Aug 27, 2015

+1; was hoping to pull an aggregation result back to Spark.

@alexbudniy

This comment has been minimized.

Show comment
Hide comment
@alexbudniy

alexbudniy Sep 25, 2015

Hello Costin!

We started using ES and Spark in my company. And we are really struggling because we can't use aggregations to shrink data which gets loaded into Spark.
I am wondering if this feature will be implemented in the nearest future. Also, maybe you need help with that? Interesting how much time in your opinion it will take to implement it.

Kind regards,
Alex

alexbudniy commented Sep 25, 2015

Hello Costin!

We started using ES and Spark in my company. And we are really struggling because we can't use aggregations to shrink data which gets loaded into Spark.
I am wondering if this feature will be implemented in the nearest future. Also, maybe you need help with that? Interesting how much time in your opinion it will take to implement it.

Kind regards,
Alex

@costin

This comment has been minimized.

Show comment
Hide comment
@costin

costin Sep 25, 2015

Member

This feature is the main one in 2.2. In terms of ETA I'm reserved in giving an estimate since aggregation works differently then scan and scroll. The current plan is to release m2 in the next two weeks and focus entirely on aggregations after that.
A prototype should be available in the following weeks - when that happens, this ticket will be updated.

Help is always appreciated however do note that complex/critical pieces of functionality, like this one, are carefully examined due to their impact. In other words, any PR raised might or might not be accepted for technical reasons.

Member

costin commented Sep 25, 2015

This feature is the main one in 2.2. In terms of ETA I'm reserved in giving an estimate since aggregation works differently then scan and scroll. The current plan is to release m2 in the next two weeks and focus entirely on aggregations after that.
A prototype should be available in the following weeks - when that happens, this ticket will be updated.

Help is always appreciated however do note that complex/critical pieces of functionality, like this one, are carefully examined due to their impact. In other words, any PR raised might or might not be accepted for technical reasons.

@andrassy

This comment has been minimized.

Show comment
Hide comment
@andrassy

andrassy Oct 2, 2015

+1 (and good luck @costin)

andrassy commented Oct 2, 2015

+1 (and good luck @costin)

@larghir

This comment has been minimized.

Show comment
Hide comment
@larghir

larghir commented Oct 7, 2015

+1

@prashanttct07

This comment has been minimized.

Show comment
Hide comment
@prashanttct07

prashanttct07 Oct 8, 2015

Hi , I am also stuck with this , could you let me know expected date of release for this feature.

prashanttct07 commented Oct 8, 2015

Hi , I am also stuck with this , could you let me know expected date of release for this feature.

@costin costin added v2.2.0-rc1 and removed v2.2.0-beta1 labels Oct 29, 2015

@DanteLore

This comment has been minimized.

Show comment
Hide comment
@DanteLore

DanteLore Dec 21, 2015

+1
Any clues when we can expect this feature?

DanteLore commented Dec 21, 2015

+1
Any clues when we can expect this feature?

@rbraley

This comment has been minimized.

Show comment
Hide comment
@rbraley

rbraley Jan 7, 2016

+1 Would love to know this as well

rbraley commented Jan 7, 2016

+1 Would love to know this as well

@costin costin added v2.2.0 and removed v2.2.0-rc1 labels Jan 8, 2016

@costin costin added v2.3.0 and removed v2.2.0 labels Jan 31, 2016

@jbaiera jbaiera added v5.0.1 and removed v2.4.1 labels Oct 26, 2016

@jbaiera jbaiera added v5.0.2 and removed v5.0.1 labels Nov 17, 2016

@jbaiera jbaiera added v5.2.0 and removed v5.0.2 labels Dec 15, 2016

@jbaiera jbaiera added v5.3.0 and removed v5.2.0 labels Jan 31, 2017

@anroyus

This comment has been minimized.

Show comment
Hide comment
@anroyus

anroyus Feb 10, 2017

Folks ... Is there a released version which supports Elastic Search "query" with aggregation, that can be called using the "newAPIHadoopRDD" API ?
If not, are there any work arounds ?

anroyus commented Feb 10, 2017

Folks ... Is there a released version which supports Elastic Search "query" with aggregation, that can be called using the "newAPIHadoopRDD" API ?
If not, are there any work arounds ?

@jbaiera jbaiera added v5.4.0 and removed v5.3.0 labels Mar 28, 2017

@dickeyre

This comment has been minimized.

Show comment
Hide comment
@dickeyre

dickeyre Mar 28, 2017

This 2.5 yo bug keeps getting moved back, the lack of this functionality is a huge gap for ES/Spark.... Please prioritize this!

dickeyre commented Mar 28, 2017

This 2.5 yo bug keeps getting moved back, the lack of this functionality is a huge gap for ES/Spark.... Please prioritize this!

@jbaiera

This comment has been minimized.

Show comment
Hide comment
@jbaiera

jbaiera Mar 28, 2017

Contributor

@dickeyre It seems like this was stuck in one of the release version tags and kept getting pushed back. For the time being the major limitation with this is how different the mechanisms for information retrieval are for scroll versus aggregations. The other problem that we run up on is the lack of suitable API hooks mixed with conflicting adherence to API standards (e.g. rdd.count() requirements to materialize the data). I'll remove the version tag to avoid any confusion on the issue's prioritization.

Contributor

jbaiera commented Mar 28, 2017

@dickeyre It seems like this was stuck in one of the release version tags and kept getting pushed back. For the time being the major limitation with this is how different the mechanisms for information retrieval are for scroll versus aggregations. The other problem that we run up on is the lack of suitable API hooks mixed with conflicting adherence to API standards (e.g. rdd.count() requirements to materialize the data). I'll remove the version tag to avoid any confusion on the issue's prioritization.

@jbaiera jbaiera added stalled and removed bug v5.4.0 labels Mar 28, 2017

@vijaykramesh

This comment has been minimized.

Show comment
Hide comment
@vijaykramesh

vijaykramesh May 10, 2017

any word? 2.5 years is a long time for some core functionality like this to not exist...

vijaykramesh commented May 10, 2017

any word? 2.5 years is a long time for some core functionality like this to not exist...

@jbaiera

This comment has been minimized.

Show comment
Hide comment
@jbaiera

jbaiera May 11, 2017

Contributor

@vijaykramesh As mentioned above when this ticket was set to stalled, currently there are significant roadblocks to this, one of which is the lack of API hooks, and another being the complexity of mixing scan and scroll with aggregation based requests.

Contributor

jbaiera commented May 11, 2017

@vijaykramesh As mentioned above when this ticket was set to stalled, currently there are significant roadblocks to this, one of which is the lack of API hooks, and another being the complexity of mixing scan and scroll with aggregation based requests.

@dmarkhas

This comment has been minimized.

Show comment
Hide comment
@dmarkhas

dmarkhas Jul 2, 2017

Contributor

+1

Contributor

dmarkhas commented Jul 2, 2017

+1

4 similar comments
@afpardillo

This comment has been minimized.

Show comment
Hide comment
@afpardillo

afpardillo commented Nov 15, 2017

+1

@rroopreddy

This comment has been minimized.

Show comment
Hide comment
@rroopreddy

rroopreddy commented Nov 21, 2017

+1

@avichaym

This comment has been minimized.

Show comment
Hide comment
@avichaym

avichaym commented Dec 24, 2017

+1

@sujun891020

This comment has been minimized.

Show comment
Hide comment
@sujun891020

sujun891020 commented Jan 11, 2018

+1

@fernandocamargoti

This comment has been minimized.

Show comment
Hide comment
@fernandocamargoti

fernandocamargoti Mar 29, 2018

I guess it'll never be done :(

fernandocamargoti commented Mar 29, 2018

I guess it'll never be done :(

@dmarkhas

This comment has been minimized.

Show comment
Hide comment
@dmarkhas

dmarkhas Mar 29, 2018

Contributor

I suggest to look into using an intermediate layer between Spark and ElasticSearch, such as Dremio; it can pushdown the aggregation to ES and return the results to Spark over JDBC.

Contributor

dmarkhas commented Mar 29, 2018

I suggest to look into using an intermediate layer between Spark and ElasticSearch, such as Dremio; it can pushdown the aggregation to ES and return the results to Spark over JDBC.

@stupidsky

This comment has been minimized.

Show comment
Hide comment
@stupidsky

stupidsky Aug 27, 2018

It doesn't seem to be finished yet, but we really need it.XD

stupidsky commented Aug 27, 2018

It doesn't seem to be finished yet, but we really need it.XD

@stupidsky

This comment has been minimized.

Show comment
Hide comment
@stupidsky

stupidsky Aug 29, 2018

I found another bottleneck that affected transmission speed.Set "es.scroll.size" to 10000 and it default value is 50.It can increase the transmission speed from ES to spark.

stupidsky commented Aug 29, 2018

I found another bottleneck that affected transmission speed.Set "es.scroll.size" to 10000 and it default value is 50.It can increase the transmission speed from ES to spark.

@maziyarpanahi

This comment has been minimized.

Show comment
Hide comment
@maziyarpanahi

maziyarpanahi Aug 29, 2018

Contributor

@stupidsky You have to be careful about that number (scroll size). If the documents are large, it becomes IO issue, where reading from the disk and sending them over the network, will become a bottleneck itself.

Contributor

maziyarpanahi commented Aug 29, 2018

@stupidsky You have to be careful about that number (scroll size). If the documents are large, it becomes IO issue, where reading from the disk and sending them over the network, will become a bottleneck itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment