Reindexing losing data? #11435

ryanbaldwin · 2015-05-30T21:04:39Z

Hi,

I've been doing some playing around with ES for the purposes of introducing it into our organization for searching audit data. Part of my adventure is data modelling and playing with mappings.

I had a small index, called "audit_v1", which has 43,754 documents. I created a second index called "audit_v2" and did a scan & scroll and bulk create the 43,754 documents 500 at a time into the new index. I've done this 5 times now, and every single time I'm seeing not only fewer number of records showing up in the audit_v2 index, but it's the same number ever single time. This is despite the _bulk api not reporting any errors in the response. From what I can tell all the documents should be there, but apparently they aren't.

Is this a legit bug, or is it possible I'm misunderstanding exactly what [index]/_count returns?

Sorry for the trivial question, but I can't find a clear answer online. I'm using v1.5.0.

Thanks

clintongormley · 2015-05-31T11:30:44Z

Hi @ryanbaldwin

Could you tell us more about your mappings, and exactly how you do the bulk and scan/scroll. Also, in scan/scroll, could you check for shard failures, and check your logs to see if any exceptions are reported.

See #11419 (comment) for a similar issue.

Also, could you give us the output of $JAVA_HOME/bin/java -version

s1monw · 2015-05-31T12:34:15Z

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

ryanbaldwin · 2015-05-31T15:28:15Z

Hey all. I'm not at home right now but will provide a detailed explanation when I'm able to. May be later tonight, or tomorrow at the latest. For now I'll go by memory while fat thumbing on my phone.

High level answers:

don't know what version of Java I'm using off the top of my head. I'm using the elasticsearch docker repo, which I think uses java 7.
don't believe there were any shard failures. The bulk API didn't report any errors on any of the create responses. Though I haven't checked ES logs. Marvel is reporting all shards as good.
didn't call refresh (though that's a great suggestion) but I wasn't querying _count immediately afterwards either. My understanding is by default everything is refreshed about every 0-5 seconds. Even if I let the index sit for a bit (say, 10 minutes), count would still report fewer docs than the original index.
The number forever rested at the lower number. The original index had 44,754. When I migrated it with scan size of 200 (x 5 shards = 1000 docs per bulk post), the migrated index was 44,709. The odd thing is if I did this multiple times, using that same scan size, it was ALWAYS 44,709. I decided to drop the scan size to 100, thinking perhaps I was overloading or something, and migrated again. This time the count dropped to 43,665 (I think). Now that I'm looking at it, this seems like a pattern. It appears as though I'm missing about 1 document for every scan/bulk cycle.
the scan/bulk is done using a simple app I wrote in clojure. In a nutshell:

scan with the scan size of x on the old index, using match_all.
Using the scoll_id returned by 1, hit the scroll API /_search/scroll(?). I then create "create" request for each document returned by scan, interleaving the "create" with each doc I'm migrating. The "create" contains the target index, type, and existing ID for the respective doc.

3, after the bulk call I repeat step 2 using the scroll id returned by the previous scroll call. Wash, rinse, repeat until hits on the call to scroll is 0.

Like I said, It seems as though I'm missing 1 doc for each scan/bulk cycle. Perhaps it's something in my script, but after specifying the scan size in the origian scan call, I don't rely on the number ever again. I simply iterate over every document.

I can provide more details later, such as an excerpt of the bulk calls, etc.

For now: thoughts?

ryan.

On Sun, May 31, 2015 at 8:35 AM, Simon Willnauer notifications@github.com
wrote:

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

Reply to this email directly or view it on GitHub:
#11435 (comment)

bleskes · 2015-05-31T16:18:24Z

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

ryanbaldwin · 2015-05-31T16:47:00Z

Negative. Setup is pretty stock. Default 5 shards + 1 replica, and however they get routed is how they get routed. That said I AM using a dynamic mapping template on the target index, but it's mostly just setting 99% of the incoming string values to not_analyzed, since this is explicit audit data and not something that really requires full text search.

As a side note, here's some possible useful information about the topology:

I have 2 ES servers, each in their own docker container, each configured identically, and each running on the same host. Each ES server has its own persistent logs/data volumes on the host (ie they are not sharing the same logs/data directories on the host). Sitting in front is an nginx doing simple round robin balancing between the two. The clojure app is doing everything through nginx, same with manual queries I run via Sense.

As far as I'm aware that topology with docker should roughly approximate (at a minimum) what 2 separate instances on two separate hosts in a network should look like.

ryan.

On Sun, May 31, 2015 at 12:19 PM, Boaz Leskes notifications@github.com
wrote:

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

Reply to this email directly or view it on GitHub:
#11435 (comment)

ryanbaldwin · 2015-05-31T16:47:45Z

Also - no parent child docs. Just 45k documents, each one an audit event, and each one completely independent.

ryan.

On Sun, May 31, 2015 at 12:46 PM, ryan baldwin ryanbaldwin@gmail.com
wrote:

Negative. Setup is pretty stock. Default 5 shards + 1 replica, and however they get routed is how they get routed. That said I AM using a dynamic mapping template on the target index, but it's mostly just setting 99% of the incoming string values to not_analyzed, since this is explicit audit data and not something that really requires full text search.
As a side note, here's some possible useful information about the topology:
I have 2 ES servers, each in their own docker container, each configured identically, and each running on the same host. Each ES server has its own persistent logs/data volumes on the host (ie they are not sharing the same logs/data directories on the host). Sitting in front is an nginx doing simple round robin balancing between the two. The clojure app is doing everything through nginx, same with manual queries I run via Sense.
As far as I'm aware that topology with docker should roughly approximate (at a minimum) what 2 separate instances on two separate hosts in a network should look like.

ryan.
On Sun, May 31, 2015 at 12:19 PM, Boaz Leskes notifications@github.com
wrote:

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

Reply to this email directly or view it on GitHub:
Reindexing losing data? #11435 (comment)

clintongormley · 2015-06-02T11:56:55Z

Hi @ryanbaldwin

The number forever rested at the lower number. The original index had 44,754. When I migrated it with scan size of 200 (x 5 shards = 1000 docs per bulk post), the migrated index was 44,709. The odd thing is if I did this multiple times, using that same scan size, it was ALWAYS 44,709. I decided to drop the scan size to 100, thinking perhaps I was overloading or something, and migrated again. This time the count dropped to 43,665 (I think). Now that I'm looking at it, this seems like a pattern. It appears as though I'm missing about 1 document for every scan/bulk cycle.

This sounds a lot like a bug in your code, perhaps:

not calling refresh (or waiting) on the destination index before retrieving the count
an off by one error on every scroll request
not collecting the last tranche of hits from scroll
not performing the final bulk write

An easy way to test this would be to use a module known to work. If you're familiar with Perl, you could install the Search::Elasticsearch module (see https://metacpan.org/pod/Search::Elasticsearch) and run the following script (updating the index names for your local setup):

#! /usr/bin/env perl

use strict;
use warnings;
use Search::Elasticsearch;

my $src  = 'source_index';
my $dest = 'dest_index';
my $node = 'localhost:9200';

my $e = Search::Elasticsearch->new( nodes => $node );

$e->indices->delete(index => $dest, ignore => 404);

$e->bulk_helper( index => $dest, verbose => 1 )
  ->reindex( source => { index => $src });

$e->indices->refresh( index => $dest );

print "\n\nNew index count: ". $e->count( index => $dest )->{count}."\n"

ryanbaldwin · 2015-06-02T23:53:18Z

Ugh. Clinton. You are indeed correct. I made the classic _bulk error: I did not append a "\n" to the final document body. Hence the n*1 documents missing.

Very sorry, but thank you for your help.

Also - Clinton - I must congratulate you on the ElasticSearch: The Definitive Guide book. This is, by far, the best tech book I've read in over a decade. Extremely easy to understand, an excellent voice, and absolute gold on every page (including the "don't forget to put a \n after the last document when using the bulk api!", which I obviously, promptly, forgot). Huge kudos to you and Zach.

Thanks for your help, and again, my apologies for the false alarm.

clintongormley · 2015-06-03T09:35:34Z

kind words @ryanbaldwin - thank you :) /cc @polyfractal

sebastialonso · 2018-04-10T17:48:02Z

I'm experiencing the same issue, but I'm using the Reindex API. And there's an additional caveat: testing the Reindex call in my local docker environment never misses a document. Doing this in the Docker swarm architecture for development and beta environments loses all data.

I'm using Elixir and its Tirexs client to communicate with Elastic.

Let me show the very basic tests I'm trying:

Fill the index with a few (really few, like 4 documents)
Creating a new "tmp" index with the modified mappings instructions
Reindex from original index to "tmp" index
Deleting original index
Creating a 'new' original index (same name as original one) with the same mappings as "tmp" index
Reindexing from "tmp" index to 'new' original index
Deleting temporal index

Pretty naive and straightforward. This is the feedback I get when running this procedure

iex(2)> Document.Mappings.apply_mappings_changes()
"1) Building tmp index"
{:ok, 200, %{acknowledged: true, index: "tmp", shards_acknowledged: true}}
"2) Reindexing to tmp"
{:ok, 200,
 %{batches: 1, created: 4, deleted: 0, failures: [], noops: 0,
   requests_per_second: -1.0, retries: %{bulk: 0, search: 0},
   throttled_millis: 0, throttled_until_millis: 0, timed_out: false, took: 149,
   total: 4, updated: 0, version_conflicts: 0}}
"3) Deleting original index"
{:ok, 200, %{acknowledged: true}}
"4) Building new version of orignal elastic_index"
{:ok, 200, %{acknowledged: true, index: "patients", shards_acknowledged: true}}
"5) Reindexing to original elastic_index"
{:ok, 200,
 %{batches: 0, created: 0, deleted: 0, failures: [], noops: 0,
   requests_per_second: -1.0, retries: %{bulk: 0, search: 0},
   throttled_millis: 0, throttled_until_millis: 0, timed_out: false, took: 1,
   total: 0, updated: 0, version_conflicts: 0}}
"6) Deleting temporal index"
{:ok, 200, %{acknowledged: true}}

Look at the difference in the ouput for point 2) and point 5). Something prevented from reindexing my 4 documents.
If anybody would like to take a look to the code, let me know.

clintongormley · 2018-04-11T10:03:11Z

@sebastialonso Please ask questions like these in the forum. The most likely thing is that your temp index hadn't refreshed before you started the second reindex, so no documents were visible to search.

AleksandarTokarev · 2020-08-26T10:58:39Z

We have around 700k records and i managed to fix this by adding timeouts of 10 seconds in 2 places

Delete original elastic_index
Create original elastic_index
10 seconds timeout with await new Promise(resolve => setTimeout(resolve, 10000))
Run reindex from temp to original index
10 seconds timeout with await new Promise(resolve => setTimeout(resolve, 10000))
Delete temp index

Without the timeout, in our case it was always missing few thousand records.

simlevesque · 2022-07-07T19:40:47Z

This still happens on 8.3.1.

clintongormley added :Search/Search Search-related issues that do not fall into other categories :Bulk feedback_needed labels May 31, 2015

ryanbaldwin closed this as completed Jun 2, 2015

tdcain89 mentioned this issue Jul 19, 2017

Minor fix to bulk indexer revzilla/elastix#2

Merged

lcawl added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Bulk labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindexing losing data? #11435

Reindexing losing data? #11435

ryanbaldwin commented May 30, 2015

clintongormley commented May 31, 2015

s1monw commented May 31, 2015

ryanbaldwin commented May 31, 2015

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

bleskes commented May 31, 2015

ryanbaldwin commented May 31, 2015

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

ryanbaldwin commented May 31, 2015

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

clintongormley commented Jun 2, 2015

ryanbaldwin commented Jun 2, 2015

clintongormley commented Jun 3, 2015

sebastialonso commented Apr 10, 2018

clintongormley commented Apr 11, 2018

AleksandarTokarev commented Aug 26, 2020 •

edited

Loading

simlevesque commented Jul 7, 2022

Reindexing losing data? #11435

Reindexing losing data? #11435

Comments

ryanbaldwin commented May 30, 2015

clintongormley commented May 31, 2015

s1monw commented May 31, 2015

ryanbaldwin commented May 31, 2015

are you calling refresh before you do the count call? How many docs are you missing, how is the index created? Can you provide more infos?

bleskes commented May 31, 2015

ryanbaldwin commented May 31, 2015

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

ryanbaldwin commented May 31, 2015

Thx Ryan for the details. Quick question - do you use parent & child documents or custom routing?

clintongormley commented Jun 2, 2015

ryanbaldwin commented Jun 2, 2015

clintongormley commented Jun 3, 2015

sebastialonso commented Apr 10, 2018

clintongormley commented Apr 11, 2018

AleksandarTokarev commented Aug 26, 2020 • edited Loading

simlevesque commented Jul 7, 2022

AleksandarTokarev commented Aug 26, 2020 •

edited

Loading