Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindexing losing data? #11435

Closed
ryanbaldwin opened this issue May 30, 2015 · 13 comments
Closed

Reindexing losing data? #11435

ryanbaldwin opened this issue May 30, 2015 · 13 comments
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. feedback_needed :Search/Search Search-related issues that do not fall into other categories

Comments

@ryanbaldwin
Copy link

Hi,

I've been doing some playing around with ES for the purposes of introducing it into our organization for searching audit data. Part of my adventure is data modelling and playing with mappings.

I had a small index, called "audit_v1", which has 43,754 documents. I created a second index called "audit_v2" and did a scan & scroll and bulk create the 43,754 documents 500 at a time into the new index. I've done this 5 times now, and every single time I'm seeing not only fewer number of records showing up in the audit_v2 index, but it's the same number ever single time. This is despite the _bulk api not reporting any errors in the response. From what I can tell all the documents should be there, but apparently they aren't.

Is this a legit bug, or is it possible I'm misunderstanding exactly what [index]/_count returns?

Sorry for the trivial question, but I can't find a clear answer online. I'm using v1.5.0.

Thanks

@clintongormley
Copy link

Hi @ryanbaldwin

Could you tell us more about your mappings, and exactly how you do the bulk and scan/scroll. Also, in scan/scroll, could you check for shard failures, and check your logs to see if any exceptions are reported.

See #11419 (comment) for a similar issue.

Also, could you give us the output of $JAVA_HOME/bin/java -version

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories :Bulk feedback_needed labels May 31, 2015
@s1monw
Copy link
Contributor

s1monw commented May 31, 2015

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

@ryanbaldwin
Copy link
Author

Hey all. I'm not at home right now but will provide a detailed explanation when I'm able to. May be later tonight, or tomorrow at the latest. For now I'll go by memory while fat thumbing on my phone. 

High level answers: 

  • don't know what version of Java I'm using off the top of my head. I'm using the elasticsearch docker repo, which I think uses java 7.
  • don't believe there were any shard failures. The bulk API didn't report any errors on any of the create responses. Though I haven't checked ES logs. Marvel is reporting all shards as good. 
  • didn't call refresh (though that's a great suggestion) but I wasn't querying _count immediately afterwards either. My understanding is by default everything is refreshed about every 0-5 seconds. Even if I let the index sit for a bit (say, 10 minutes), count would still report fewer docs than the original index. 
  • The number forever rested at the lower number. The original index had 44,754. When I migrated it with scan size of 200 (x 5 shards = 1000  docs per bulk post), the migrated index was 44,709. The odd thing is if I did this multiple times, using that same scan size, it was ALWAYS 44,709. I decided to drop the scan size to 100, thinking perhaps I was overloading or something, and migrated again. This time the count dropped to 43,665 (I think). Now that I'm looking at it, this seems like a pattern. It appears as though I'm missing about 1 document for every scan/bulk cycle. 
  • the scan/bulk is done using a simple app I wrote in clojure. In a nutshell:
  1. scan with the scan size of x on the old index, using match_all.
  2. Using the scoll_id returned by 1, hit the scroll API /_search/scroll(?). I then create "create" request for each document returned by scan, interleaving the "create" with each doc I'm migrating. The "create" contains the target index, type, and existing ID for the respective doc. 

3, after the bulk call I repeat step 2 using the scroll id returned by the previous scroll call. Wash, rinse, repeat until hits on the call to scroll is 0.

Like I said, It seems as though I'm missing 1 doc for each scan/bulk cycle. Perhaps it's something in my script, but after specifying the scan size in the origian scan call, I don't rely on the number ever again. I simply iterate over every document. 

I can provide more details later, such as an excerpt of the bulk calls, etc. 

For now: thoughts?

  • ryan.

On Sun, May 31, 2015 at 8:35 AM, Simon Willnauer notifications@github.com
wrote:

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

Reply to this email directly or view it on GitHub:
#11435 (comment)

@bleskes
Copy link
Contributor

bleskes commented May 31, 2015

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

@ryanbaldwin
Copy link
Author

Negative. Setup is pretty stock. Default 5 shards + 1 replica, and however they get routed is how they get routed. That said I AM using a dynamic mapping template on the target index, but it's mostly just setting 99% of the incoming string values to not_analyzed, since this is explicit audit data and not something that really requires full text search. 

As a side note, here's some possible useful information about the topology:

I have 2 ES servers, each in their own docker container, each configured identically, and each running on the same host. Each ES server has its own persistent logs/data volumes on the host (ie they are not sharing the same logs/data directories on the host). Sitting in front is an nginx doing simple round robin balancing between the two. The clojure app is doing everything through nginx, same with manual queries I run via Sense.

As far as I'm aware that topology with docker should roughly approximate (at a minimum) what 2 separate instances on two separate hosts in a network should look like. 

  • ryan.

On Sun, May 31, 2015 at 12:19 PM, Boaz Leskes notifications@github.com
wrote:

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

Reply to this email directly or view it on GitHub:
#11435 (comment)

@ryanbaldwin
Copy link
Author

Also - no parent child docs. Just 45k documents, each one an audit event, and each one completely independent.

  • ryan.

On Sun, May 31, 2015 at 12:46 PM, ryan baldwin ryanbaldwin@gmail.com
wrote:

Negative. Setup is pretty stock. Default 5 shards + 1 replica, and however they get routed is how they get routed. That said I AM using a dynamic mapping template on the target index, but it's mostly just setting 99% of the incoming string values to not_analyzed, since this is explicit audit data and not something that really requires full text search. 
As a side note, here's some possible useful information about the topology:
I have 2 ES servers, each in their own docker container, each configured identically, and each running on the same host. Each ES server has its own persistent logs/data volumes on the host (ie they are not sharing the same logs/data directories on the host). Sitting in front is an nginx doing simple round robin balancing between the two. The clojure app is doing everything through nginx, same with manual queries I run via Sense.
As far as I'm aware that topology with docker should roughly approximate (at a minimum) what 2 separate instances on two separate hosts in a network should look like. 

@clintongormley
Copy link

Hi @ryanbaldwin

  • The number forever rested at the lower number. The original index had 44,754. When I migrated it with scan size of 200 (x 5 shards = 1000 docs per bulk post), the migrated index was 44,709. The odd thing is if I did this multiple times, using that same scan size, it was ALWAYS 44,709. I decided to drop the scan size to 100, thinking perhaps I was overloading or something, and migrated again. This time the count dropped to 43,665 (I think). Now that I'm looking at it, this seems like a pattern. It appears as though I'm missing about 1 document for every scan/bulk cycle.

This sounds a lot like a bug in your code, perhaps:

  • not calling refresh (or waiting) on the destination index before retrieving the count
  • an off by one error on every scroll request
  • not collecting the last tranche of hits from scroll
  • not performing the final bulk write

An easy way to test this would be to use a module known to work. If you're familiar with Perl, you could install the Search::Elasticsearch module (see https://metacpan.org/pod/Search::Elasticsearch) and run the following script (updating the index names for your local setup):

#! /usr/bin/env perl

use strict;
use warnings;
use Search::Elasticsearch;

my $src  = 'source_index';
my $dest = 'dest_index';
my $node = 'localhost:9200';

my $e = Search::Elasticsearch->new( nodes => $node );

$e->indices->delete(index => $dest, ignore => 404);

$e->bulk_helper( index => $dest, verbose => 1 )
  ->reindex( source => { index => $src });

$e->indices->refresh( index => $dest );

print "\n\nNew index count: ". $e->count( index => $dest )->{count}."\n"

@ryanbaldwin
Copy link
Author

Ugh. Clinton. You are indeed correct. I made the classic _bulk error: I did not append a "\n" to the final document body. Hence the n*1 documents missing.

Very sorry, but thank you for your help.

Also - Clinton - I must congratulate you on the ElasticSearch: The Definitive Guide book. This is, by far, the best tech book I've read in over a decade. Extremely easy to understand, an excellent voice, and absolute gold on every page (including the "don't forget to put a \n after the last document when using the bulk api!", which I obviously, promptly, forgot). Huge kudos to you and Zach.

Thanks for your help, and again, my apologies for the false alarm.

@clintongormley
Copy link

kind words @ryanbaldwin - thank you :) /cc @polyfractal

@lcawl lcawl added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Bulk labels Feb 13, 2018
@sebastialonso
Copy link

I'm experiencing the same issue, but I'm using the Reindex API. And there's an additional caveat: testing the Reindex call in my local docker environment never misses a document. Doing this in the Docker swarm architecture for development and beta environments loses all data.

I'm using Elixir and its Tirexs client to communicate with Elastic.

Let me show the very basic tests I'm trying:

  • Fill the index with a few (really few, like 4 documents)
  • Creating a new "tmp" index with the modified mappings instructions
  • Reindex from original index to "tmp" index
  • Deleting original index
  • Creating a 'new' original index (same name as original one) with the same mappings as "tmp" index
  • Reindexing from "tmp" index to 'new' original index
  • Deleting temporal index

Pretty naive and straightforward. This is the feedback I get when running this procedure

iex(2)> Document.Mappings.apply_mappings_changes()
"1) Building tmp index"
{:ok, 200, %{acknowledged: true, index: "tmp", shards_acknowledged: true}}
"2) Reindexing to tmp"
{:ok, 200,
 %{batches: 1, created: 4, deleted: 0, failures: [], noops: 0,
   requests_per_second: -1.0, retries: %{bulk: 0, search: 0},
   throttled_millis: 0, throttled_until_millis: 0, timed_out: false, took: 149,
   total: 4, updated: 0, version_conflicts: 0}}
"3) Deleting original index"
{:ok, 200, %{acknowledged: true}}
"4) Building new version of orignal elastic_index"
{:ok, 200, %{acknowledged: true, index: "patients", shards_acknowledged: true}}
"5) Reindexing to original elastic_index"
{:ok, 200,
 %{batches: 0, created: 0, deleted: 0, failures: [], noops: 0,
   requests_per_second: -1.0, retries: %{bulk: 0, search: 0},
   throttled_millis: 0, throttled_until_millis: 0, timed_out: false, took: 1,
   total: 0, updated: 0, version_conflicts: 0}}
"6) Deleting temporal index"
{:ok, 200, %{acknowledged: true}}

Look at the difference in the ouput for point 2) and point 5). Something prevented from reindexing my 4 documents.
If anybody would like to take a look to the code, let me know.

@clintongormley
Copy link

@sebastialonso Please ask questions like these in the forum. The most likely thing is that your temp index hadn't refreshed before you started the second reindex, so no documents were visible to search.

@AleksandarTokarev
Copy link

AleksandarTokarev commented Aug 26, 2020

We have around 700k records and i managed to fix this by adding timeouts of 10 seconds in 2 places

  1. Delete original elastic_index
  2. Create original elastic_index
  3. 10 seconds timeout with await new Promise(resolve => setTimeout(resolve, 10000))
  4. Run reindex from temp to original index
  5. 10 seconds timeout with await new Promise(resolve => setTimeout(resolve, 10000))
  6. Delete temp index

Without the timeout, in our case it was always missing few thousand records.

@simlevesque
Copy link

This still happens on 8.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. feedback_needed :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

8 participants