-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher: getting painless script errors having null in email admin watches #43184
Labels
Comments
Pinging @elastic/es-core-features |
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Aug 7, 2019
data for this cluster in the last 2 minutes a run time error in the Watcher transform can occur [1]. This could be due some slow down in getting the .monitoring-es data from source cluster to monitoring cluster. The Watch condition can pass if there are un-resolved alerts, but the transform assumes there is a cluster state (green,yellow,red) and will throw an error while trying to read it. This commit adds an "unknown" state to the possible cluster state. Based on the existing logic, the unknown state is effectively ignored since only the email action uses it and for this scenario, as the email action will not fire since it is not a new or resolved alert. Fixes elastic#43184 [1] org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Aug 7, 2019
data for this cluster in the last 2 minutes a run time error in the Watcher transform can occur [1]. This could be due some slow down in getting the .monitoring-es data from source cluster to monitoring cluster. The Watch condition can pass if there are un-resolved alerts, but the transform assumes there is a cluster state (green,yellow,red) and can throw an error while trying to read it. This commit prevents passing the condition if the cluster state was not found. Fixes elastic#43184 [1] ``` org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 ``` from .watcher-history: ``` "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.util.ArrayList.rangeCheck(ArrayList.java:657)", "java.util.ArrayList.get(ArrayList.java:433)", "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;", " ^---- HERE" ], ```
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Aug 7, 2019
If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. Fixes elastic#43184 [1] ``` org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 ``` from .watcher-history: ``` "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.util.ArrayList.rangeCheck(ArrayList.java:657)", "java.util.ArrayList.get(ArrayList.java:433)", "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;", " ^---- HERE" ], ```
jakelandis
added a commit
that referenced
this issue
Oct 8, 2019
If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes #43184
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this issue
Oct 8, 2019
…tic#45308) If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes elastic#43184
jakelandis
added a commit
that referenced
this issue
Oct 9, 2019
If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes #43184
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Elasticsearch version (
bin/elasticsearch --version
): 7.1.1Plugins installed: no
JVM version (
java -version
): On Elastic CloudOS version (
uname -a
if on a Unix-like system): On Elastic CloudDescription of the problem including expected versus actual behavior:
I thought this has been fixed in #32923.
But I noticed a cluster running 7.1.1 is still getting the following error:
Steps to reproduce:
The admin email has not been set up.
And just having the default system watchers.
The text was updated successfully, but these errors were encountered: