Paging support for aggregations #4915

Closed
aaneja opened this Issue Jan 27, 2014 · 207 comments

Projects

None yet
@aaneja
aaneja commented Jan 27, 2014

Terms aggregation does not support a way to page through the buckets returned.
To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?

@jpountz
Contributor
jpountz commented Feb 3, 2014

Paging is tricky to implement because document counts for terms aggregations are not exact when shard_size is less than the field cardinality and sorting on count desc. So weird things may happen like the first term of the 2nd page having a higher count than the last element of the first page, etc.

Regarding your question, terms aggregations run in two phases on the shard-level: first they compute counts for every possible term, and then they pick the top shard_size ones. Increasing size (or shard_size) only makes the 2nd step more costly. Given that the runtime of the first step is linear with the number of matched documents and that the runtime of the 2nd step is O(#unique_values * log(shard_size)), if you only have a limited number of unique values compared to the number of matched documents, doing the paging on client-side would be more efficient. On the other hand, on high-cardinality-fields, your first approach based on an exclude would probably be better.

As a side-note, min_doc_count has no effect on runtime performance when it is greater than or equal to 1. Only min_doc_count=0 is more costly given that it requires Elasticsearch to also fetch terms that are not contained in any match.

@haschrisg

@jpountz would storing the results of an aggregation in a new index be feasible? In general, it'd be great to have a way of dealing with both aggregations with high cardinality, and nested aggregations that produce a large number (millions) of results -- even if the cost of that is that they're not sorted properly when paging.

@jpountz
Contributor
jpountz commented Mar 12, 2014

If it makes sense for your use-case, this is something that you could consider implementing on client-side, by running hourly/daily these costly aggregations, storing the result in an index and using this index between two runs to explore the results of the aggregation?

@apatrida
Contributor
apatrida commented Sep 2, 2014

When sorting by term instead of count, why would paging then not be possible? For example, having a terms aggregation, with top hits aggregation which could produce an overly large result set without having paging on the terms aggregation. Not all aggregations wants want to sort by count.

@loren loren added a commit to GSA/asis that referenced this issue Sep 15, 2014
@loren loren [#78785104] Fix image search pagination
- This is a stopgap fix until we figure out what changes we want to make to the API w/r/t the "total" field, as we don't have an inexpensive way of determining the total number of buckets. elastic/elasticsearch#4915
b12eb19
@tugberkugurlu

I can see that this may not be possible but for a top_hits aggregation, I really need this functionality. I have the below aggregation query:

POST sport/_search
{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "defense_strength": {
                  "lte": 83.43
                }
              }
            },
            {
              "range": {
                "forward_strength": {
                  "gte": 91
                }
              }
            }
          ]
        }
      }
    }
  }, 
  "aggs": {
    "top_teams": {
        "terms": {
          "field": "primaryId"
        },
        "aggs": {
          "top_team_hits": {
            "top_hits": {
              "sort": [
                {
                    "forward_strength": {
                        "order": "desc"
                    }
                }
              ],
              "_source": {
                  "include": [
                      "name"
                  ]
              },
              "from": 0,
              "size" : 1
            }
          }
        }
      }
    }
  }
}

This produces the below result for an insanely cheap index (with low number of docs):

    {
         "took": 2,
         "timed_out": false,
         "_shards": {
                "total": 5,
                "successful": 5,
                "failed": 0
         },
         "hits": {
                "total": 5,
                "max_score": 0,
                "hits": []
         },
         "aggregations": {
                "top_teams": {
                     "buckets": [
                            {
                                 "key": "541afdfc532aec0f305c2c48",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "y6jZ31xoQMCXaK23rPQgjA",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Barcelona"
                                                         },
                                                         "sort": [
                                                                98.32
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe08532aec0f305c5f28",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "hewWI0ZpTki4OgOeneLn1Q",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Arsenal"
                                                         },
                                                         "sort": [
                                                                94.3
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe09532aec0f305c5f2b",
                                 "doc_count": 1,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 1,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "x-_YBX5jSba8qsEuB8guTQ",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Real Madrid"
                                                         },
                                                         "sort": [
                                                                91.34
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            }
                     ]
                }
         }
    }

What I need here is the ability to get first 2 aggregation result and get the other 2 (in this case, only 1) in other request.

@missingpixel

If paging aggregations is not possible, how do we use ES for online stores where products of different colours are grouped together? Or, what if there are five million authors in the example at: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/top-hits.html ? Aggregate them and perform pagination in-memory?

If that's not possible, what else can be done in place of grouping in Solr?

Thank you

@adrienbrault

A parameter allowing to hide the first X buckets from the response would be nice.

@adrienbrault

@clintongormley Why was this issue closed ?

@bobbyhubbard

Reopen please?

@mikelrob
mikelrob commented Jan 2, 2015

+1 for pagination while sorted on term not doc count

@daniel1028

Can you re-open this please?

I understand that aggregation pagination will create performance issue in larger numbers of records. But it will not affect smaller numbers of records right?

The performance issue will be happen only if have more records.Why don't we have this support at least for smaller set of records.

Why do we have to hesitate to add this support considering larger amount of data? If we have this support , it would be very helpful for us to paginate smaller amount data.

May be we can inform users, this support will be efficient only for smaller amount data. Whenever the amount for data increases ,the performance will hit highly.

@clintongormley
Member

We have been against adding pagination support to the terms (and related) aggregations because it hides the cost of generating the aggregations from the user. Not only that, it can produce incorrect ordering because term based aggregations are approximate.

That said, we support pagination on search requests, which are similarly costly (although accurate).

While some users will definitely shoot themselves in the foot with pagination (eg #4915 (comment)), not supporting pagination does limit some legitimate use cases.

I'll reopen this ticket for further discussion.

@byronvoorbach

I would love to see this feature being added to ES, but I understand the cost at which it would come.
I'm currently working for a client which needed such a feature, but since it didn't exist yet we solved it with doing 2 queries:

The first query has a terms aggregation on our field on which we want grouping and orders the aggregation based on the doc.score. We set the size of the aggregation to 0, so that we get all buckets for that query.
We then parse the result and get the keys from the buckets corresponding to the given size and offset. ( eg bucket 30-40 for page 3).

We then perform a new query, filtering all results based on the keys from the first query. Next to the query is a term aggregation (on the same field as before), and we add a top_hits aggregation to get the results for those (10) buckets.

This way we don't have to load all 40 buckets and get the top_hits for those buckets, which increases performance.

Loading all buckets and top 10 hits per bucket took around 20 seconds for a certain query. With the above change we managed to bring it back to 100ms.

Info:

  • +-60 million records
  • around 1500 buckets for average query
  • around 300 documents per bucket

This might help someone out as a workaround till such a feature exists within Elasticsearch

@davidvgalbraith
Contributor

Hey! I too would like paging for aggregations. That's all.

@bauneroni

I'd also love to see this someday but I do understand the burden (haven't used that word in a long time) and costs to implement this. This feature would be quite handy for my client's application which is operating on ~250GB+ of data.

Well, yeah.. what he^ said ๐Ÿ‘

@vinusebastian

@aaneja with respect to "Terms aggregation does not support a way to page through the buckets returned.
To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?"

How did you exclude already seen pages? Or how did you keep track of seen pages? Also what did you learn about performance issues with such an approach?

@dakrone
Member
dakrone commented Apr 10, 2015

We discussed this and one potential idea is to add the ability to specify a start_term for aggregations, that would allow the aggregation to skip all of the preceding terms, then the client could implement the paging by retrieving the first page of aggregations, then sending the same request with the start_term being the last term of the previous results. Otherwise the aggregation will still incur the overhead of computing the results and sub-aggregations for each of the "skipped" buckets.

To better understand this, it would be extremely useful to get more use-cases out of why people need this and how they would use it, so please do add those to this ticket.

@2e3s
2e3s commented Apr 10, 2015

+1 for that. There may be tens of thousands unique terms by which we group, and gather statistics by subaggregations. It can be sorted by any of these subaggregations, so it's gonna be very costly anyway, but its speed with ES is currently more that bearable as well as its precision, and if not sending such big json data between servers and holding it with PHP (which isn't good at all as for now), it would be fine. I even think of some plugin which would do this simple job. But this still will require computing a sorting subaggregation if used.

@a0s
a0s commented Apr 29, 2015

+1

@benneq
benneq commented May 6, 2015

+1

@pauleil
pauleil commented May 6, 2015

+1

@jaynblue
jaynblue commented May 6, 2015

+1

@dragonkid

+1

@genme
genme commented May 7, 2015

+1

@aznamier
aznamier commented May 7, 2015

+1

@bobbyhubbard

+1

@robinmitra

+1

@DeLaMuerte

+1

@mfischbo
mfischbo commented May 8, 2015

+1

@khaines
khaines commented May 8, 2015

+1

@srijan55
srijan55 commented May 8, 2015

+1

@nathraQ
nathraQ commented May 9, 2015

+1

@v4run
v4run commented May 11, 2015

+1

@rmuhzin
rmuhzin commented May 11, 2015

+1

@dubadub
dubadub commented May 13, 2015

+1

@paxnoop
paxnoop commented May 14, 2015

+1

@leedohyun

+1

@caJaeger

+1

@a-dorosh

+1

@jpountz
Contributor
jpountz commented May 21, 2015

I've been thinking more about this recently: I don't think we can reasonably add pagination options to the terms aggregations. However maybe we can make it easier to implement from client-side. Here is a proposal of a plan:

Sorting by term

This case could be handled efficiently if we had options to only run the terms aggregation on a range of terms, similarly to the include option. For instance if you got aaa, aab and abc (size=3) on the first page, you could get the next page by running a terms aggregation on terms that are greater than abc. On server-side, this could be dealt with efficiently by finding the ordinals that match the range boundaries and only emitting ordinals between these boundaries at the fielddata level.

Sorting by count/sub-aggregation

In that case, the easiest/most efficient solution would be to provide the terms aggregation with a list of exclude terms that contains terms that occurred on the previous page. So if you page size is 10 and you want page 4, you would need to exclude the 30 terms that appeared on the first 3 pages. Functionnality already exists so this would mostly be a matter of documentation.

@nekulin
nekulin commented May 23, 2015

Yes now you can create auto increment id and filter filter id>19 size=10 and id > 29 size 10. But grouping all the data will still be

@Castlejoe

Paging should work efficiently for many terms (I have 50.000+) and I don't see how excluding previous terms ("you would need to exclude the 30 terms that appeared on the first 3 pages") would work here. To exclude terms, I would need to have all terms in case the user want to jump to the last page...

For me, and probably many other, the simplest solution would be a perfect start: do exactly what you do today but return only the requested page's data.

@abeninskibede

+1

@emilgerebo

+1

@nile1801

+1

@Jan539
Jan539 commented Jun 19, 2015

I'm relatively new to this topic so maybe I don't see the problem here but for my use case it is as following:

I send a query with my aggregation (using size 0) and get back a bucket array containing all the results. Now I could store those buckets and pick the first X, then the next X entries. So why isn't it possible to just let ES execute it with size 0 to store all results and give back the first X bucket entries together with the total amount of bucket entries. With the next query I can give the start index of the next X entries.

From the comments above I think I misunderstood smth but this would be my idea.

@nvh0412
nvh0412 commented Jun 24, 2015

+1

@rajeshetty

+1

@sylvinus
Contributor
sylvinus commented Jul 6, 2015

+1, judging from the number of comments here this seems like something users really want!

@sciphilo

+1

@Mobyman
Mobyman commented Jul 13, 2015

๐Ÿ‘

@rbart
rbart commented Jul 17, 2015

+1

@shakura
shakura commented Jul 18, 2015

I think it will be really useful if pagination will be possible in term aggregation with routing.

@npinon
npinon commented Jul 21, 2015

+1

@gitcarbs

+1

@pseidemann

+1

@dorr-fg
dorr-fg commented Aug 6, 2015

+1

@jordanfowler

+1

@jordanfowler

To do this on the client side, as others have mentioned, you'll need to keep track of the previously matched terms and exclude them. This will always require N (where N is page) queries in order to ask for an arbitrary page of buckets. Here's a working example implementation (client-side) in Ruby:

  def default_search_aggs
    {
      aggs: {
        profiles: {
          terms: {
            field: 'profile_id',
            order: {
              _count: 'desc'
            }
          }
        }
      }
    }
  end

  def search_aggregation(options = {})
    min_doc_count = options[:min_doc_count].to_i > 0 ? options.delete(:min_doc_count).to_i : 1
    page = options[:page].to_i > 0 ? options.delete(:page).to_i : 1
    buckets, exclude = [], options.delete(:exclude).to_a

    if page > 1
      buckets = search_aggregation(min_doc_count: min_doc_count, page: page - 1, exclude: exclude) do |_exclude|
        exclude = _exclude
      end
    end

    exclude = [exclude, buckets.collect { |term| term[:key] }].flatten.compact.uniq
    yield exclude if block_given?
    opts = default_search_aggs.dup
    opts[:aggs][:profiles][:terms][:min_doc_count] = min_doc_count
    opts[:aggs][:profiles][:terms][:exclude] = exclude.join('|')

    # search contains the query alongside this aggregation
    search(opts.merge(options)).response.aggregations.profiles.buckets
  end

[Update] Forgot to yield previous exclude (this is super slow for large pages)

@juntezhang

+1

@ddombrowskii

+1

@juntezhang

I think paging should be supported when count is not used in the aggregations. For example, I am using the Terms Aggregation to get a distinct set of values of a field only. I do not need the counts. However, since my distinct set of values is huge, paging is needed for performance. I am trying to solve this with script-based filtering by creating groups of the values to speed up the aggregation, but this is a sub-optimal workaround.

@mikelrob

^^^^^ +1

@rubensegovia

+1

@yangjm
yangjm commented Aug 26, 2015

@jpountz Proposed solution works only when we already know the terms before the page being visited. Please think about when there are thousands of terms, on the UI user jump directly from page 1 to 500.

+1

@aviadlich

+1

@jpountz
Contributor
jpountz commented Sep 10, 2015

@yangjm if this is the use-case then there is nothing we can do that would be better than today

@2e3s
2e3s commented Sep 10, 2015

@jpountz I think, you're generally asked here just to do something like the SQL servers do (just an OFFSET+LIMIT after the resulting set is formed) which is slow but does its job. Correct me if I'm wrong or if it's not possible.

@jpountz
Contributor
jpountz commented Sep 10, 2015

@2e3s It is possible but compared to what SQL servers can do there are two additional problems:

  • elasticsearch is distributed on several shards, so you can't know when working at a shard level how a term will rank globally. So if you want to get terms between positions 10000 and 10010 you essentially need to ask every shard for all terms between positions 0 and 10010 (we have the same issue for distributed search, unless you use scrolling)
  • terms aggregations are not accurate when sorting by count (the default) and some terms to appear both at the end of a page and at the beginning of the next one, while other terms could easily get omitted for the opposite reason. Accuracy is generally good if you have a zipfian distribution of term frequencies, but this is typically something which is not true anymore as you paginate deeper and deeper as terms have frequencies that become closer and closer.
@yangjm
yangjm commented Sep 10, 2015

@jpountz To handle this requirement in my project, I added a parent entity when indexing the data, let it holds the term value, and make the entity being searched as it's child entity. Then I use has_child instead of terms aggregation to search, which allows me to do paging on terms (parent entity).

The inconvenience is, when there are more than one field in the entity need to be aggregated and paged, I have to replicate the entity for each field, because of one entity can only have one parent.

@2e3s
2e3s commented Sep 10, 2015

@jpountz

you essentially need to ask every shard for all terms

Yes, that's slow and AFAIK that's approximately how SQL servers often work (if not possible to use indexes for everything).

some terms to appear both at the end of a page and at the beginning of the next one, while other terms could easily get omitted for the opposite reason

If the result isn't guaranteed and can differ from request to request, then it doesn't matter anyway. We will slice the results and get this discrepancy anyway. Thus this becomes just another warning in docs as that inconstancy in sorted by count desc results (which is missing in docs now as I see) or many others in my opinion.
Basically it could be just an option to hide X first terms as someone's told above.

@jodok
jodok commented Sep 10, 2015

fyi, we implemented a query / SQL layer on top of elasticsearch that solves exact counts & fast aggregations with Crate (https://github.com/crate/crate): https://crate.io/docs/stable/sql/aggregation.html

@jpountz
Contributor
jpountz commented Sep 10, 2015

that inconstancy in sorted by count desc results (which is missing in docs now as I see)

See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts.

I don't like adding pagination options with the proposed implementation because then it would feel like it is the right way to paginate with aggregations while in practice it would probably still be a better idea to over-request and paginate on client-side and/or use an exclusion list that contains all terms from the previous pages.

@jborden13

+1

@idserge7

+1

@blotgg
blotgg commented Sep 29, 2015

+1

@jorjsmile

Recently, got same problem. And here is my solution for this. General idea is same, as was proposed:

  1. Get n results for page N. On next page N+1, skip previous n*(N-1) docs.
  2. For custom access (k page). Proceed as with usual stream reading:
  • iterate n*(N+k) items, and
  • start read(requesting all required fields) from there.
    Two requests as you can see. Better then nothing

(1)

To save some network traffic, I use custom _score. General idea is to give unique number for each document, and not to loose search relevance. For simplicity I used linear package function _score * %LARGEST_TABLE_ID% * 100 + doc['id'], those who has same _score - will differ/sort by id also. Documents that have _score larger than %EDGE CRITERIA% will be skipped. Here is json of this part::

"min_score" : 0.0,
  "query": {    
  "function_score": {
       "query": {
            ....
       },
       "functions": [
         {
           "script_score": {
             "script": "_score*%LARGEST_TABLE_ID%*100+doc['id']>%EDGE CRITERIA%? -1 : score*%LARGEST_ID%*100"
    ....
  },
  "aggs" : { 
   //terms aggregation here....
        "size": 20, //- 20 documents per page
        "order": [
          {
            "score_doc": "desc"
          }
        ]
      },
      "aggs": {
        "top" : {
         "top_hits"{
             "size" : 1 //get _source of coolest document in group
        ...
        "score_doc": {  //aggregation inside terms. Extracting the better result from group
          "max": {
            "script": "_score"
        ...
    },

(2) ====== Custom Access =======

a) First I should iterate 500 documents (page 25). Using previous body. we change

"order": [
          size : 500, //page 25
          {
            "score_doc": "asc" //in code I'll get first array element
          }
]
//I remove top_hits aggregation. Only _score of first document is matter.

b). %EDGE CRITERIA% = %FIRST_DOCUMENT_FROM_a)%['_score']. And bind that param in (1). et Voila.
Ugly, but works.

@mmahalwy
mmahalwy commented Oct 6, 2015

+1

@tadyjp
tadyjp commented Mar 4, 2016

+1

@madapaja
madapaja commented Mar 4, 2016

+1

@raphaelMalie

+1

My use case is simple : i have a list of cities, and several classifieds in these cities. I need to display only a maximum of 5 classifieds per city. And i want only 30 results per page.

@lababidi
Contributor
lababidi commented Mar 8, 2016

+1

@meet2kautuk

+1 +1 +1
This is one of the basic requirements of any search engine and ES should also have it.

@yinrong
yinrong commented Mar 17, 2016

+1

@dhaval1986

+1

@git-ashish

+1

@pkhlop
pkhlop commented Mar 25, 2016

+1

@rotorsolutions

+1 We have a large index and a lot of fields we'd like to group on. We need to be able to filter on these columns too (for instance on a date) and are hardly interested in the 'seperate' documents but merely in the bucket totals. Therefore aggregation paging would be really useful!

@buildAI
buildAI commented Mar 25, 2016 edited

+1

@Chadwiki

+1

@vmlellis
vmlellis commented Apr 4, 2016

+1

@jzbahrai
jzbahrai commented Apr 6, 2016

+1

@greg-symphony

+1

@felixbarny

If want this to be implemented, please don't respond with +1 but add a +1 reaction to the first comment like this:

reaction

Me and 123 others will thank you for not receiving a notification every time. Thx :)

@aditiamahdar

+1

@makeyang
Contributor

+1

@lust4gaming

+1

@vitaliikapliuk
vitaliikapliuk commented May 17, 2016 edited

you can operate pagination using [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html](top hits)

you always have size and from

{
  "size": 0,
  "aggregations": {
    "country_codes": {
      "terms": {
        "field": "countryCodes",
        "size": 10
      },
      "aggs": {
        "top_offers": {
          "top_hits": {
            "from": 0,
            "size": 10
          }
        }
      }
    }
  }
}     
@paullovessearch

@tankomaz That is for paging through the documents returned within a bucket. What we're looking for is paging through the buckets themselves.

@juaniiton1

+1

@MadeInChina

+1

@yossicahen

+1

@ZeynepP
ZeynepP commented Jun 9, 2016

+1

@crynut84

+1

@ivanyyong

+1

@cerisier

+1

@jetaix
jetaix commented Jun 29, 2016

+1

@orrchen
orrchen commented Jul 3, 2016

+1

@zhanggdi
zhanggdi commented Jul 5, 2016

+1

@fraank
fraank commented Jul 11, 2016

+1

@echozhjun

+1

@tugberkugurlu

So, it turns out that this is the most requested feature of all ๐Ÿ˜„ https://github.com/elastic/elasticsearch/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc

Is it possible to get an update on whether this will land as a feature at some point?

@heiwen
heiwen commented Jul 24, 2016

+1

@jpountz
Contributor
jpountz commented Jul 25, 2016

@tugberkugurlu This feature cannot be supported in the general case. However we would like to collect more use-cases for paging aggregations in order to understand whether some use-cases can be solved in different ways, so if you have a use-case that requires paging aggregations, we'd be interested to know more about what the use-case is and what the aggregation that you are currently running looks like.

@jento-tm

+1

@tugberkugurlu

@jpountz Thanks, that would be really great. I am happy to chat to you and explain my needs in details within the actual domain context. How should I contact to you privately?

@orrchen
orrchen commented Jul 25, 2016

@jpountz Hi, we use Elastic for analytics purposes - we store events in Elastics and aggregate them and show them yo users - page views, clicks, durations, etc..
When we sort and show in table for example, it's impossible to display the whole dataset so we need pagination. That is currently not possible, so we need to have another index or a different database where we store the aggregated results and we compute them on a scheduled task. The results are not real time and it's much more work.

@jpountz
Contributor
jpountz commented Jul 26, 2016

@tugberkugurlu Can you share it here publicly? We don't need low-level details, mainly information about the aggregation that is being run (eg. a monthly date histogram with an inner terms aggregation on an ip field) and the use-case for it (eg. ingesting aggregated data in another system that other teams within the campany are used to work with). Something that is interesting to us as well is whether you would always consume the whole result through pagination or if there are reasons that would make you stop half way.

@orrchen Thanks!

@tugberkugurlu

@jpountz see this then: #4915 (comment)

Does that help?

@amontalenti
amontalenti commented Aug 4, 2016 edited

@jpountz Since you're asking for real-world use cases, here is a simple one.

We offer an API to customers which is powered by an Elasticsearch aggregation queries, as I described in this blog post on Elastic.co, Pythonic Analytics with Elasticsearch. You can read the documentation for the API here.

Basically, under the hood, it runs a terms query on a url field, and orders results by a sum agg over a pageviews field (simplifying some details). Here's a gist with roughly the agg being run in production.

The API supports page and limit parameters. Right now, when a user issues a query for page=10, limit=1000, the API needs to actually issue a query for page*limit results, which is 10,000 results, and then do a client-side slice of the last 1,000 results before returning the results to the customer. The performance of this is atrocious: it usually takes 5-6 seconds or so. However, queries for the first page of results happen in milliseconds. As a result of how terrible the performance of this is, I had to introduce limitations on the total page/limit combinations. For example, we don't allow paging to page 25 of results when asking for 1,000 items, since that would result in a single agg query for 25,000 results -- and merely parsing the JSON for all those items is way too much.

So, yes, very much +1 for paging support for aggregations!

@2e3s
2e3s commented Aug 4, 2016 edited

@amontalenti imagine how you would do that on a distributed database. It seems improbable. The best option they could implement is "hide first X terms" which is basically just not sending first X results albeit fetching or computing them (but unfortunately the ES team doesn't want to do even that). If your bottleneck is parsing JSON (like it is for me) or network load then it can work, if your ES is slow then no.
You can fetch the terms records, apply the offset and limit and query top_hits separately for them.

@felixbarny

My use case is a product search which should be grouped by product groups. A product is for example a T-Shirt "Style" in black or in blue. The product group would then be "T-Shirt Style" which consists of the products "T-Shirt Style black" and "T-Shirt Style blue". Let's say there is also a product group "T-Shirt Swag" which also has T-Shirts in black and in blue.

When searching for "t-shirt" I want to get two results - one for each product group. The displayed product in the product list should be so called "favorite product" of the product group which is just a flag of the product. If there is no product in the result with the favorite flag, just show the first product in the product group. In our example, the favorite product is always the black T-Shirt.

When searching for blue, I also want to get two results. This time the black T-Shirts don't match the query, so the product list shows the blue shirts.

Currently I'm solving this by loading all results (max 50,000) and group by product group id on the client side.

@felixbarny

To solve this properly, I'd have to be able to scroll through the results of a top hits aggregation.

A simmilar use case is google like website search. Show the top 5 results of each website with the ability to paginate through the websites.

@seanyoo
seanyoo commented Aug 19, 2016

+1

@ycit
ycit commented Aug 30, 2016

+1

@s1monw
Contributor
s1monw commented Aug 30, 2016

+1 why is nobody fixing this?

@s12v
s12v commented Sep 1, 2016

+1

@tugberkugurlu

+1 why is nobody fixing this?

@s1monw see #4915 (comment)

@tugberkugurlu

@jpountz have you seen #4915 (comment), do you need more info?

I can understand this cannot be supported in generic case, would be good to have some docs which highlight some workarounds. For example, this makes sense: https://discuss.elastic.co/t/large-results-sets-and-paging-for-aggregations/20653/2

aggs are executed on the entire result set. therefore if it managed to fit
into the memory you should just get it. paging will mean that you throw
away a lot of results that were already calculated. the only way to "page"
is by limiting the results that you are running aggs on. for example if
your data is sorted by date and you want to build histogram for the results
one date range at a time."

However, my data is still large and it can give > 1000 results which is still not ideal.

@tugberkugurlu

Currently I'm solving this by loading all results (max 50,000) and group by product group id on the client side.

@felixbarny how long do you spend loading that data into your client? Sorry, I haven't done the calculation for my thing but I have the similar issue here.

@felixbarny

I haven't measured to be honest, but it's still moderately fast.

@baiyunping333

mark! I need this feature.

@sadokmtir

I want to get the second 10 buckets from an aggregation. How can I do that ??
Is the pagination supported yet ???

@felixbarny

Nope. You have request for 20 buckets :(

@sadokmtir
sadokmtir commented Sep 5, 2016 edited

But this is only a very humble example to make it simple. But in reality it would be like results from 1000 buckets per page. So, as I see it is not possible to do that through aggregation in elastic-search ??

@felixbarny

That's right. It's currently not possible. That's what this issue is about. But it seems to be very hard to implement because the operation may be distributed across many nodes/shards.

@felixbarny

Because this feature is by far the most requested one, I feel like elastic should care more about it.

Maybe write a blog which explains why it is difficult to implement and what the alternatives are. Additionally, it would be great if they would post one update per month here about the state of the internal discussions and the progress they made towards a solution of this issue.

Ignoring this issue or closing as "won't fix" would upset a lot of folks.

(no offense towards elastic here)


P.S.: please use the ๐Ÿ‘ feature of GitHub instead of commenting with +1 (otherwise every participant will receive an useless notification).

@bpolaszek

Hi everyone,

I'm trying to switch a BI application from Solr 5.4 to ElasticSearch 2.3.5, and this was a feature I was using in Solr - https://cwiki.apache.org/confluence/display/solr/Result+Grouping

It would be great if Elastic implements it :)

@azngeek
azngeek commented Oct 26, 2016

+1

@bioform
bioform commented Oct 27, 2016

I have the simple usecase.
We have:

  1. list of Model
  2. Model has "version"(integer)
  3. Model has GUID.
  4. Models with the same GUID are "different version of the same model". So Models with the same GUID should have a different "version" field values.
  5. we use ES to search models by text query(we make a search through all Models Versions). And we need to show only highest Model version in the search result.
  6. we need to paginate the above results

I don't see any way to solve this task without aggrigation pagination

@cqeescalona

+1

@sourcec0de

I am performing a pipeline aggregation over a large number of buckets. All I need is the bucket average not the buckets themselves. It would save some significant bandwidth if there were at the very least an option to not return the buckets and just the result of the pipeline agg.

@josemen
josemen commented Nov 2, 2016

+1

@azgard121

+1

@dmitriytitov

+1

@makeyang
Contributor
makeyang commented Nov 4, 2016

+1

@itachi04199

+1

@jvkumar
jvkumar commented Nov 16, 2016

+1

@jvkumar
jvkumar commented Nov 16, 2016 edited

@jpountz any idea if this feature is in the future roadmap or this will never happen? What is the exact status of this issue?

@markharwood
Contributor

There isn't a single "this feature" being discussed here so it's worth grouping the different problems that I think have been articulated here:

Paging terms sorted by derived values

Several requests ( #4915 (comment), #4915 (comment) and #4915 (comment) )seem to require doc_count or a sub agg to drive the sort order and has been pointed out this is hard/impossible to implement efficiently in a distributed system (but see the "exhaustive analysis" section below)

Paging terms sorted by term value

This is potentially achievable with an "after term X" parameter. We don't have any work in progress on this at present. Requested in #4915 (comment) and #4915 (comment)

Search result collapsing

Many requests are not about paging aggregations per-se but using the terms agg and top_hits to group documents in search results in a way that limits the number of docs returned under any one group. Requested in #4915 (comment) , #4915 (comment) #4915 (comment) , #4915 (comment) , #4915 (comment) , #4915 (comment) and #4915 (comment)

Before we even consider pagination or distributed systems, this is a tricky requirement on a single shard which recently had work in Lucene. The approach the change uses is best explained with an analogy: if I was making a compilation album of 1967's top hit records:

  1. A vanilla query's results might look like a "Best of the Beatles" album - no diversity
  2. A "grouping query" would produce "The 10 top-selling artists of 1967 - some killer and quite a lot of filler"
  3. A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity

When people use a terms agg and top_hits together they are trying to implement option 2) which is not great due to the "filler". Option 3) is implemented using the diversified sampler agg with a top_hits agg.

The bad news with diversified queries is that even on a single shard system we cannot implement paging sensibly (if the top-20 hits are page one of search results on what page would it make sense to introduce some more of the high-quality Beatles hits and relax the diversity constraint?). There's no good single answer to that question. There's also the issue of "backfilling" too when there's no diversity to be had in the data. When we also throw in the distributed aspects to this problem too there's little hope for a good solution.

Exhaustive Analysis

Some comments suggest some users simply want to analyse a high-cardinality field and doing so in one agg request today requires too much memory. In this case pagination isn't necessarily a requirement to serve pages to end users in the right order - it's merely a way of breaking up a big piece of analysis into manageable bite-sized requests. Imagine trying to run through all user-accounts sorting their logged events by last-access date to see which accounts should be marked as dormant. The top-level term (account_id) is not important to the overall order but we do want to sort the child aggs by a term (access_date).
I certainly ran into this issue and the good news is I think we have an approach to tackle it and a PR is in progress.
Basically terms are divided into unsorted partitions but within each partition the client can run an agg to sort top-level terms by any child-aggs etc in a way that filters most of the garbage out. The client would have to do the work of glueing top results from each partitioned request to get a final result together but this is perhaps the most achievable step forward in the short term.

@hoffin688

+1

@markharwood
Contributor

Closing in favour of #21487 which provides a way to break terms aggregations into an arbitrary number of partitions which clients can page through.

I recognise this will not solve all of the different use cases raised on this ticket and which I tried to summarise in my last comment. Because of the diverse concerns listed on this ticket any remaining concerns should ideally be broken into separate new issues for those where #21487 is not a solution.

For those looking for the "Paging terms sorted by term value" use case I would ask is the sort order important or was it just a way of subdividing a big request into manageable chunks? If the latter then #21487 should work fine for you. If there is a genuine need for the former then please open another issue where we can focus in on solving that particular problem.

For the "Search result collapsing" use case I outlined then the conclusion is that there is no single diversity policy that meets all needs and that even if there was one, implementing it would be prohibitively complex/slow/resource intensive. Again, we can open another issue to debate that specific requirement but I'm pessimistic about the chances of reaching a satisfactory conclusion.

Thanks for all your comments :)

@markharwood markharwood removed the discuss label Nov 24, 2016
@paullovessearch
paullovessearch commented Nov 24, 2016 edited

Added #21785 to capture the search results collapsing use case.

@stopsopa

+1

@clintongormley clintongormley locked and limited conversation to collaborators Dec 16, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.