Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlushNotAllowedEngineException during optimize #3631

Closed
mattweber opened this Issue Sep 5, 2013 · 12 comments

Comments

Projects
None yet
4 participants
@mattweber
Copy link
Contributor

commented Sep 5, 2013

I have encountered the following exception a few times recently on 0.90.3 while I was during some testing. Seems to happen sporadically on a random shard during an optimize, maybe I am hitting some weird race condition?

[2013-09-05 05:42:23,925][DEBUG][action.admin.indices.optimize] [samcro] [stackoverflow][0], node[i45bAah_SUa9zPPLmFYw4g], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@5ef574db]
org.elasticsearch.index.engine.FlushNotAllowedEngineException: [stackoverflow][0] already flushing...
    at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:818)
    at org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:1014)
    at org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:512)
    at org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:115)
    at org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:49)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:228)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:205)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:179)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
@javanna

This comment has been minimized.

Copy link
Member

commented Sep 6, 2013

Hey @mattweber
can you share the test you ran to obtain the above stacktrace? No way you could reproduce it consistently right?

@ghost ghost assigned javanna Sep 6, 2013

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 6, 2013

Hey @javanna,

I was playing around with bulk indexing the stackoverflow posts.xml data. I was consistently deleting and reindexing the data tweaking settings and measuring the performance difference. I started to notice this with the following:

  • parent/child mapping described in this blog post
  • 5 shards, no replicas (I reproduced with less shards too)
  • refresh -1
  • bulk size of 100
  • 8 feeding processes
  • translog flush settings doubled from defaults
  • segments per tier 30
  • optimize segments of 5
  • increased index buffer (40% of 6GB heap)
  • bulk threadpool fixed at size 24
  • no searching during feeding or optimize

Feeding would take ~45min for just under 11M docs. The blog post above has a python feeding script attached, I was basically using an updated version of that which uses python multiprocessing module to break out the work across 8 processes. Unfortunately I just left on vacation and forgot to commit those changes so I can't provide that script until next week. Please let me know if you need any more information.

Thanks!

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 6, 2013

Ohh and es2graphite which executes cluster health, node stats, and node info rest calls was running every 10 seconds.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2013

@javanna

Ok, I can still reproduce this. Here is a gist of my feeding scripts to be used with the posts.xml file downloaded from the torrent linked above, https://gist.github.com/mattweber/d969e9c91b45503ac452. Here is some info from logs, maybe it has something with the fact I am changing settings right before the optimize?

I am going to try to come up with an easier reproduce method so you don't need to download all the stackoverflow data. I will keep you updated. Thanks.

[2013-09-15 18:09:58,437][DEBUG][cluster.service          ] [samcro] cluster state updated, version [8], source [update-settings]
[2013-09-15 18:09:58,439][INFO ][index.merge.policy       ] [samcro] [stackoverflow][0] updating [segments_per_tier] from [30.0] to [10.0]
[2013-09-15 18:09:58,439][INFO ][index.shard.service      ] [samcro] [stackoverflow][0] updating refresh_interval from [-1] to [1s]
[2013-09-15 18:09:58,443][DEBUG][river.cluster            ] [samcro] processing [reroute_rivers_node_changed]: execute
[2013-09-15 18:09:58,444][DEBUG][river.cluster            ] [samcro] processing [reroute_rivers_node_changed]: no change in cluster_state
[2013-09-15 18:09:58,469][INFO ][index.translog           ] [samcro] [stackoverflow][0] updating flush_threshold_ops from [10000] to [5000]
[2013-09-15 18:09:58,469][INFO ][index.translog           ] [samcro] [stackoverflow][0] updating flush_threshold_size from [500mb] to [200mb]
[2013-09-15 18:09:58,470][INFO ][index.merge.policy       ] [samcro] [stackoverflow][1] updating [segments_per_tier] from [30.0] to [10.0]
[2013-09-15 18:09:58,470][INFO ][index.shard.service      ] [samcro] [stackoverflow][1] updating refresh_interval from [-1] to [1s]
[2013-09-15 18:09:58,470][INFO ][index.translog           ] [samcro] [stackoverflow][1] updating flush_threshold_ops from [10000] to [5000]
[2013-09-15 18:09:58,470][INFO ][index.translog           ] [samcro] [stackoverflow][1] updating flush_threshold_size from [500mb] to [200mb]
[2013-09-15 18:09:58,560][DEBUG][cluster.service          ] [samcro] processing [update-settings]: done applying updated cluster_state
[2013-09-15 18:09:58,785][DEBUG][action.admin.indices.optimize] [samcro] [stackoverflow][0], node[s9_7RSkTSUiNThZpuvLU6A], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.optimize.OptimizeRequest@598d2fb7]
org.elasticsearch.index.engine.FlushNotAllowedEngineException: [stackoverflow][0] already flushing...
    at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:818)
    at org.elasticsearch.index.engine.robin.RobinEngine.optimize(RobinEngine.java:1014)
    at org.elasticsearch.index.shard.service.InternalIndexShard.optimize(InternalIndexShard.java:512)
    at org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:115)
    at org.elasticsearch.action.admin.indices.optimize.TransportOptimizeAction.shardOperation(TransportOptimizeAction.java:49)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:228)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:205)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:179)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)

@kimchy

This comment has been minimized.

Copy link
Member

commented Sep 16, 2013

This can happen during relocation, at that point, flush is not allowed to be executed. optimize operation defaults to performing flush, though you don't have to do it to optimize, in which case, you can set the flush flag to false?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2013

Hey @kimchy,

There are no relocations going on when this happens. It is a single node cluster and I have reproduced it on an index with a single shard. In fact, it happens more consistently when I have less shards. It happens pretty much every time when I have 1 or 2 shards. Once I get up to 3+ shards it doesn't happen as often. Maybe I am hitting some sort of race condition?

Thanks.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2013

I reproduced using 0.90.4 as well.

@kimchy

This comment has been minimized.

Copy link
Member

commented Sep 16, 2013

we also periodically flush due to the transaction log (not just when relocating, apologies for not explaining to details), and if you issue a flush while another flush is happening, then we will throw the mentioned failure as well. We have an internal flag to wait for ongoing flushes, we can expose it through the API.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2013

I am not sure if we should expose it or if we should just set it to true (ie. wait for ongoing flushes) on an optimize call. Folks are used to wait on a optimize call and if we fail that they might get confused calling optimize again. I think we should just wait by default?

@kimchy

This comment has been minimized.

Copy link
Member

commented Sep 16, 2013

@s1monw agreed, I think its a good default to have when calling optimize, will do it.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2013

+1

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2013

Sounds good to me, thanks for the info!

@kimchy kimchy closed this in c8a72d9 Sep 16, 2013

kimchy added a commit that referenced this issue Sep 16, 2013

FlushNotAllowedEngineException during optimize
default to wait if a flush is on going when executed as part of optimize request
closes #3631

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

FlushNotAllowedEngineException during optimize
default to wait if a flush is on going when executed as part of optimize request
closes elastic#3631
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.