Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher: getting painless script errors having null in email admin watches #43184

Closed
Leaf-Lin opened this issue Jun 13, 2019 · 1 comment · Fixed by #45308
Closed

Watcher: getting painless script errors having null in email admin watches #43184

Leaf-Lin opened this issue Jun 13, 2019 · 1 comment · Fixed by #45308

Comments

@Leaf-Lin
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): 7.1.1
Plugins installed: no
JVM version (java -version): On Elastic Cloud

OS version (uname -a if on a Unix-like system): On Elastic Cloud

Description of the problem including expected versus actual behavior:
I thought this has been fixed in #32923.
But I noticed a cluster running 7.1.1 is still getting the following error:

[instance-0000000006] failed to execute [script] transform for [h0rBJUPOSgeB-ZGRbrQ60A_elasticsearch_cluster_status_77a4d766-3da2-4b2b-92d9-451de304e9cf-2019-06-13T02:31:50.556Z]
org.elasticsearch.script.ScriptException: runtime error
	at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:38) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:23) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:505) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:309) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$5(ExecutionService.java:410) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:605) [x-pack-watcher-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.1.jar:7.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_144]
	at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_144]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:429) ~[?:?]
	... 11 more

Steps to reproduce:

The admin email has not been set up.
And just having the default system watchers.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Aug 7, 2019
data for this cluster in the last 2 minutes a run time error in the
Watcher transform can occur [1]. This could be due some slow down in
getting the .monitoring-es data from source cluster to monitoring cluster.

The Watch condition can pass if there are un-resolved alerts, but the
transform assumes there is a cluster state (green,yellow,red) and will
throw an error while trying to read it.

This commit adds an "unknown" state to the possible cluster state.
Based on the existing logic, the unknown state is effectively ignored
since only the email action uses it and for this scenario, as the email
action will not fire since it is not a new or resolved alert.

Fixes elastic#43184

[1]
org.elasticsearch.script.ScriptException: runtime error
	at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	...
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Aug 7, 2019
data for this cluster in the last 2 minutes a run time error in the
Watcher transform can occur [1]. This could be due some slow down in
getting the .monitoring-es data from source cluster to monitoring cluster.

The Watch condition can pass if there are un-resolved alerts, but the
transform assumes there is a cluster state (green,yellow,red) and can
throw an error while trying to read it.

This commit prevents passing the condition if the cluster state was not
found.

Fixes elastic#43184

[1]
```
org.elasticsearch.script.ScriptException: runtime error
	at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	...
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
```
from .watcher-history:
```
     "type" : "script_exception",
                "reason" : "runtime error",
                "script_stack" : [
                  "java.util.ArrayList.rangeCheck(ArrayList.java:657)",
                  "java.util.ArrayList.get(ArrayList.java:433)",
                  "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;",
                  "                                   ^---- HERE"
                ],
```
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Aug 7, 2019
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.

This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert.  The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.

a) fixing the exception is as easy as checking the size of the
array before accessing it.

b) fixing the following incorrect behavior is a bit more intrusive

- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"

In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.

This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.

Fixes elastic#43184

[1]
```
org.elasticsearch.script.ScriptException: runtime error
	at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:94) ~[?:?]
	at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ct ...:1152) ~[?:?]
	at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:49) ~[x-pack-watcher-7.1.1.jar:7.1.1]
	...
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
```
from .watcher-history:
```
     "type" : "script_exception",
                "reason" : "runtime error",
                "script_stack" : [
                  "java.util.ArrayList.rangeCheck(ArrayList.java:657)",
                  "java.util.ArrayList.get(ArrayList.java:433)",
                  "state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;",
                  "                                   ^---- HERE"
                ],
```
@jakelandis jakelandis added the >bug label Oct 6, 2019
jakelandis added a commit that referenced this issue Oct 8, 2019
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.

This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert.  The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.

a) fixing the exception is as easy as checking the size of the
array before accessing it.

b) fixing the following incorrect behavior is a bit more intrusive

- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"

In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.

This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.

This changes changes to use a template-like method to populate the 
version_created for the default monitoring watches. The version is 
set to 7.5 since that is where this is first introduced.

Fixes #43184
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Oct 8, 2019
…tic#45308)

If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.

This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert.  The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.

a) fixing the exception is as easy as checking the size of the
array before accessing it.

b) fixing the following incorrect behavior is a bit more intrusive

- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"

In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.

This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.

This changes changes to use a template-like method to populate the 
version_created for the default monitoring watches. The version is 
set to 7.5 since that is where this is first introduced.

Fixes elastic#43184
jakelandis added a commit that referenced this issue Oct 9, 2019
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.

This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert.  The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.

a) fixing the exception is as easy as checking the size of the
array before accessing it.

b) fixing the following incorrect behavior is a bit more intrusive

- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"

In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.

This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.

This changes changes to use a template-like method to populate the 
version_created for the default monitoring watches. The version is 
set to 7.5 since that is where this is first introduced.

Fixes #43184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants