Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete By Query under heavy indexing load causes OOM errors #6025

Closed
xyu opened this issue May 2, 2014 · 6 comments

Comments

Projects
None yet
4 participants
@xyu
Copy link
Contributor

commented May 2, 2014

Hello,

We are running an ES 1.1.1 cluster and when we bulk index into it with a high write load we have found that it can trigger OOM errors if we run deletebyquery calls. The symptoms of the problems appears to be the following:

  1. Index into the cluster with a heavy write load, this will cause merges to pile up. During the pileup the merge thread pool queue goes up and stays at ~120 per node. (Maybe related to #5779?)
  2. Merges are no longer able to keep up with new segments created from indexing operations and segments per node shoots up. (We have seen over 30k segments per node.)
  3. Long GC cycles are triggered which may or may not cause the node to drop out of the cluster.
  4. When a Delete By Query call is issued an internal refresh is triggered (#3593) which fails due to an out of memory error and causes the node to be removed from the cluster for not responding. (See gist for logs. In this example es13.iad sends a call to es13.sat which experiences OOM causing es13.iad to remove it from the cluster.)

Please let me know if there are any other info I can provide that may be helpful.

@xyu

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2014

It appears that we are experiencing segment explosion issues during these times and are seeing queues in the merge thread pool. For now we are working around this issue by backing off on bulk index when merges fall behind in our indexing jobs. It looks like this issue is also being worked on in #6066

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jul 28, 2014

Fixed by #6066

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Mar 4, 2015

Alas, deleteByQuery is not throttled when merges are falling behind.

@mikemccand mikemccand reopened this Mar 4, 2015

@mikemccand mikemccand self-assigned this Mar 4, 2015

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Mar 4, 2015

I'll make this operation throttled like we do for index/create ops.

However I'm doubtful this is "enough" to always prevent segment explosion, i.e. #7052 is a better (trickier) long term solution.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented May 28, 2015

@mikemccand can we close this?

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented May 28, 2015

Yeah, I'll close it ... I think you can still easily provoke OOME on 1.6, but in 2.0 we are switching to a plugin that runs scan/scroll query and then bulk delete the resulting IDs, which does fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.