Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete by query should not silently refresh index #3593

Closed
uschindler opened this issue Aug 28, 2013 · 4 comments

Comments

Projects
None yet
5 participants
@uschindler
Copy link
Contributor

commented Aug 28, 2013

Hi this issue caused lots of trouble because it was not clear why this happened. I had some index updates where a (quite common) approach is used:

I have to update a bulk of documents with some higher level group key (not the uid). Like:

doc1: { groupKey: 'foo', _id: 'bar1' }
doc2: { groupKey: 'foo', _id: 'bar2' }
doc3: { groupKey: 'foo', _id: 'bar3' }

The code that updates this group of documents does not know the real _id of those already in the index (it just knows that the whole group updates), so it first deletes all documents by using deleteByQuery on the group key. After that it reindexes all documents in the group (with possibly different new _id values).

If you don't disable index refreshing, for a short time, the whole group would be disappearing and reappearing then. So to make the whole group reindex "atomic" you would disable index refreshing before that and reenable it afterwards (or do manual refreshing at all - what I do for this index in any case).

Unfortunately, deleteByQuery forcefully refreshes the index. Which is hard to understand because its not documented. There is just a comment in the code that the refresh is needed although its heavy, because when executing a Lucene IndexWriter deleteByQuery, ElasticSearch does not know what documents were really deleted, so all internal tracking does not work (it cannot update version consistency,...)

I was discussing with Martijn on IRC (not even he was aware that deleteByQuery does not work with disabled refreshing), he suggested that maybe the query is executed in ElasticSearch itsself and then it starts a bulk on _uid deletes (this is also one possibility for a workaround in our case if number of deletes is small).

In my opinion the better variant would be to do it like in Apache Solr: Apache Solr has 2 different IndexReaders open: One for searching the index (this one is refreshed in those periods of times), but a second one is another NRT reader on the IndexWriter that is used to do some updates of data structures after IndexWriter has written stuff. So updating of the ES internal data should be done with a new NRT reader and not the one used for searching.

@ghost ghost assigned martijnvg Aug 30, 2013

@martijnvg

This comment has been minimized.

Copy link
Member

commented Aug 30, 2013

Thanks for writing this down! Patches welcome :)

@govindm

This comment has been minimized.

Copy link

commented Dec 4, 2014

Any update on this issue? we are facing similar problem

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 31, 2014

Depends on #7052

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2015

DBQ is moved to a plugin in ES 2.0.

@mikemccand mikemccand closed this Jul 13, 2015

s1monw added a commit to s1monw/elasticsearch that referenced this issue Oct 11, 2017

Use separate searchers for "search visibility" vs "move indexing buff…
…er to disk"

Today, when ES detects it's using too much heap vs the configured indexing
buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move
the bytes to disk, clear version map, etc.

But this has the unexpected side effect of making newly indexed/deleted
documents visible to future searches, which is not nice for users who are trying
to prevent that, e.g. elastic#3593.

This is also an indirect spinoff from elastic#26802 where we potentially pay a big
price on rebuilding caches etc. when updates / realtime-get is used. We are
refreshing the internal reader for realtime gets which causes for instance
global ords to be rebuild. I think we can gain quite a bit if we'd use a reader
that is only used for GETs and not for searches etc. that way we can also solve
problems of searchers being refreshed unexpectedly aside of replica recovery /
relocation.

Closes elastic#15768
Closes elastic#26912

s1monw added a commit that referenced this issue Oct 12, 2017

Use separate searchers for "search visibility" vs "move indexing buff…
…er to disk (#26972)

Today, when ES detects it's using too much heap vs the configured indexing
buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move
the bytes to disk, clear version map, etc.

But this has the unexpected side effect of making newly indexed/deleted
documents visible to future searches, which is not nice for users who are trying
to prevent that, e.g. #3593.

This is also an indirect spinoff from #26802 where we potentially pay a big
price on rebuilding caches etc. when updates / realtime-get is used. We are
refreshing the internal reader for realtime gets which causes for instance
global ords to be rebuild. I think we can gain quite a bit if we'd use a reader
that is only used for GETs and not for searches etc. that way we can also solve
problems of searchers being refreshed unexpectedly aside of replica recovery /
relocation.

Closes #15768
Closes #26912

s1monw added a commit that referenced this issue Oct 13, 2017

Use separate searchers for "search visibility" vs "move indexing buff…
…er to disk (#26972)

Today, when ES detects it's using too much heap vs the configured indexing
buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move
the bytes to disk, clear version map, etc.

But this has the unexpected side effect of making newly indexed/deleted
documents visible to future searches, which is not nice for users who are trying
to prevent that, e.g. #3593.

This is also an indirect spinoff from #26802 where we potentially pay a big
price on rebuilding caches etc. when updates / realtime-get is used. We are
refreshing the internal reader for realtime gets which causes for instance
global ords to be rebuild. I think we can gain quite a bit if we'd use a reader
that is only used for GETs and not for searches etc. that way we can also solve
problems of searchers being refreshed unexpectedly aside of replica recovery /
relocation.

Closes #15768
Closes #26912
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.