Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about scroll API and find_each / find_in_batches #651

Closed
howtwizer opened this issue Dec 23, 2016 · 2 comments
Closed

Question about scroll API and find_each / find_in_batches #651

howtwizer opened this issue Dec 23, 2016 · 2 comments
Labels

Comments

@howtwizer
Copy link

Elasticsearch 5.1

Following docs:
http://www.rubydoc.info/gems/elasticsearch-api/Elasticsearch/API/Actions#scroll-instance_method

When I'm trying to reproduce example 'Call the scroll API until all the documents are returned', I notice that this call

# Call the `scroll` API until empty results are returned
while r = client.scroll(scroll_id: r['_scroll_id'], scroll: '5m') and not r['hits']['hits'].empty? do
  puts "--- BATCH #{defined?($i) ? $i += 1 : $i = 1} -------------------------------------------------"
  puts r['hits']['hits'].map { |d| d['_source']['title'] }.inspect
  puts
end

doesn't contains the results of this initial call:

# Open the "view" of the index with the `scan` search_type
r = client.search index: 'test', search_type: 'scan', scroll: '5m', size: 10

So in the end we missing positions counting by size of initial scroll call.

Example.
If we have index ['test1', 'test2',' test3' ..... 'test100']
calling the scroll API with initial size 10 will return ['test11', 'test12', .... 'test100'] with missing first 10 results.

I have same results in elasticsearch console - first call of scroll does not include results of initial call, so seems that the scroll method works like it need.

But the question is in find_each
According docs:

 Iterate effectively over models using the `find_in_batches` method.
          #
          # All the options are passed to `find_in_batches` and each result is yielded to the passed block.
          #
          # @example Print out the people's names by scrolling through the index
          #
          #     Person.find_each { |person| puts person.name }
          #
          #     # # GET http://localhost:9200/people/person/_search?scroll=5m&search_type=scan&size=20
          #     # # GET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=c2Nhbj...
          #     # Test 0
          #     # Test 1
          #     # Test 2
          #     # ...
          #     # # GET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=c2Nhbj...
          #     # Test 20
          #     # Test 21
          #     # Test 22
          #

But, in fact it will return

          #     # Test 20
          #     # Test 21
          #     # Test 22

Think that the problem in rewriting of 'response' var in find_in_batches.

@stale
Copy link

stale bot commented Aug 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 31, 2020
@stale stale bot closed this as completed Sep 7, 2020
@viktorianer
Copy link

viktorianer commented Feb 2, 2021

When someone is looking for it, Molly Struve wrote a great guide and solution:

After an initial call use scroll as following:

loop do
  hits = response.dig('hits', 'hits')
  break if hits.empty?

  hits.each do |hit|
    # Process/do something with the hit or hits
  end

  response = Search::Client.scroll(
    :body => { :scroll_id => response['_scroll_id'] }, 
    :scroll => '1m'
  )
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants