-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.StackOverflowError for the entire cluster #24553
Comments
I'm assigning this to myself because I recognize the elements in the stacktrace but I'm not going to have a look until the at least tomorrow morning. @moshe could you post a gist of the entire stack overflow? It is usually useful to have the root. |
Hi @nik9000, thanks for your quick response. |
|
So reproduced.... |
@moshe this stack overflow can happen when searching a very big regular expression or when using a syntax for regexp that can lead to an explosion of states. Are you using
fails with a stack overflow error. We have a protection against the explosion of number of states in a non deterministic automaton but for a deterministic one we don't check the size. |
One of our production cluster experienced the same issue recently due to the abuse of regex/fuzzy queries by developers, but could it be better for Elasticsearch (or maybe Lucene) to set a limit like that on non deterministic automaton, such that the cluster won't be entirely brought down? Thanks! |
After more in-depth investigation, our particular issue was found being caused by prefix query on a very long query string. By looking at source code, I see PrefixQuery does not put a limit on maxDeterminizedStates when instantiate an AutomatonQuery.
Is it sensible to limit prefix length & maxDeterminizedStates here? |
Thanks for the investigation @xgwu . The automaton is already determinized so the |
I still think the fix is to put protections in Lucene to prevent the stackoverflow or to rewrite the method so it isn't recursive. I don't think we can allow any queries to shoot the node like this. We could catch the StackOverflow and try to recover but I believe those errors are like OOM, hard to be sure that you've fully recovered from it. If we investigate that to the point where we are sure we have recovered from it then we can just catch it and Lucene doesn't have to change. |
We filter long strings and do not use regex queries like mentioned above. We have no idea which query produces this error but it kills elastic several times a day. Any clue what kind of query shoots the node? Would be great if this bug could be fixed as soon as possible. |
@hbrxa it sounds like you are suffering from a different bug. please open a new issue with all of the relevant details. |
+1 We had a similar issue (on v5.3.0) , coming from kibana users running a simple regex. The regex doesn't even have to really run as search in order to cause the StackOverflowException - just running _validate on such a query cause the same issue. To reproduce it just type in kibana's Discover:
|
This issue has been fixed in Lucene: |
Elasticsearch version:
5.2.1
Plugins installed:
discovery-ec2
repository-s3
search-guard-5
x-pack (xpack.security.enabled: false xpack.monitoring.enabled: false xpack.graph.enabled: false xpack.watcher.enabled: false)
JVM version (
java -version
):1.8.0_121
OS version (
uname -a
if on a Unix-like system):Linux ip-192-168-153-58 3.19.0-79-generic #87~14.04.1-Ubuntu SMP Wed Dec 21 18:12:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
I have no idea which query produced the error but suddenly all the data node of the cluster (13) got the same error:
and then an infinite numbers of the same line:
More details:
keyword
andlowercase_normalizer
The text was updated successfully, but these errors were encountered: