Watcher: getting painless script errors having null in email admin watches #43184

Leaf-Lin · 2019-06-13T02:48:26Z

Elasticsearch version (bin/elasticsearch --version): 7.1.1
Plugins installed: no
JVM version (java -version): On Elastic Cloud

OS version (uname -a if on a Unix-like system): On Elastic Cloud

Description of the problem including expected versus actual behavior:
I thought this has been fixed in #32923.
But I noticed a cluster running 7.1.1 is still getting the following error:

[instance-0000000006] failed to execute [script] transform for [h0rBJUPOSgeB-ZGRbrQ60A_elasticsearch_cluster_status_77a4d766-3da2-4b2b-92d9-451de304e9cf-2019-06-13T02:31:50.556Z]
org.elasticsearch.script.ScriptException: runtime error
	at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:38) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:23) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:505) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:309) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$5(ExecutionService.java:410) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:605) [x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.1.jar:7.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_144]
	at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_144]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:429) ~[?:?]
	... 11 more

Steps to reproduce:

The admin email has not been set up.
And just having the default system watchers.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-13T02:48:49Z

Pinging @elastic/es-core-features

data for this cluster in the last 2 minutes a run time error in the Watcher transform can occur [1]. This could be due some slow down in getting the .monitoring-es data from source cluster to monitoring cluster. The Watch condition can pass if there are un-resolved alerts, but the transform assumes there is a cluster state (green,yellow,red) and will throw an error while trying to read it. This commit adds an "unknown" state to the possible cluster state. Based on the existing logic, the unknown state is effectively ignored since only the email action uses it and for this scenario, as the email action will not fire since it is not a new or resolved alert. Fixes elastic#43184 [1] org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

data for this cluster in the last 2 minutes a run time error in the Watcher transform can occur [1]. This could be due some slow down in getting the .monitoring-es data from source cluster to monitoring cluster. The Watch condition can pass if there are un-resolved alerts, but the transform assumes there is a cluster state (green,yellow,red) and can throw an error while trying to read it. This commit prevents passing the condition if the cluster state was not found. Fixes elastic#43184 [1] ``` org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 ``` from .watcher-history: ``` "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.util.ArrayList.rangeCheck(ArrayList.java:657)", "java.util.ArrayList.get(ArrayList.java:433)", "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;", " ^---- HERE" ], ```

If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. Fixes elastic#43184 [1] ``` org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1] ... Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 ``` from .watcher-history: ``` "type" : "script_exception", "reason" : "runtime error", "script_stack" : [ "java.util.ArrayList.rangeCheck(ArrayList.java:657)", "java.util.ArrayList.get(ArrayList.java:433)", "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;", " ^---- HERE" ], ```

If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes #43184

…tic#45308) If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes elastic#43184

If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes #43184

Leaf-Lin added the :Data Management/Watcher label Jun 13, 2019

jakelandis mentioned this issue Aug 7, 2019

Fix cluster alert for watcher/monitoring IndexOutOfBoundsException #45294

Closed

jakelandis mentioned this issue Aug 7, 2019

Fix cluster alert for watcher/monitoring IndexOutOfBoundsExcep… #45308

Merged

jakelandis added the >bug label Oct 6, 2019

jakelandis mentioned this issue Oct 6, 2019

Monitoring: Transform script in elasticsearch_cluster_status.json results in ArrayOutOfBoundsException #33649

Closed

jakelandis closed this as completed in #45308 Oct 8, 2019

jakelandis mentioned this issue Oct 8, 2019

Fix cluster alert for watcher/monitoring IndexOutOfBoundsExcep… #47756

Merged

chrisronline mentioned this issue Nov 14, 2019

Check hits.total before accessing array in Monitoring Watches elastic/kibana#48796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher: getting painless script errors having null in email admin watches #43184

Watcher: getting painless script errors having null in email admin watches #43184

Leaf-Lin commented Jun 13, 2019

elasticmachine commented Jun 13, 2019

Watcher: getting painless script errors having null in email admin watches #43184

Watcher: getting painless script errors having null in email admin watches #43184

Comments

Leaf-Lin commented Jun 13, 2019

elasticmachine commented Jun 13, 2019