[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

droberts195 · 2021-07-29T16:45:22Z

We observed that some jobs failed during a rolling upgrade
in Elastic Cloud. This happened because steps of the job
open sequence failed with exceptions after core Elasticsearch
services shut down in response to the SIGTERM.

This change makes the persistent task executor for anomaly
detection jobs ignore exceptions received after the JVM has
received a shutdown signal, for example a SIGTERM. By doing
nothing in response to such exceptions the persistent task
remains in cluster state and will get assigned to a different
node after the current node leaves the cluster.

We observed that some jobs failed during a rolling upgrade in Elastic Cloud. This happened because steps of the job open sequence failed with exceptions after core Elasticsearch services shut down in response to the SIGTERM. This change makes the persistent task executor for anomaly detection jobs ignore exceptions received after the JVM has received a shutdown signal, for example a SIGTERM. By doing nothing in response to such exceptions the persistent task remains in cluster state and will get assigned to a different node after the current node leaves the cluster.

elasticmachine · 2021-07-29T16:45:25Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-07-29T19:11:21Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/job/task/OpenJobPersistentTasksExecutor.java

+        if (autodetectProcessManager.isNodeDying() == false) {
+            hasRunningDatafeedTask(jobTask.getJobId(), hasRunningDatafeedTaskListener);
+        }


I personally would like there to be a predicate before hasRunningDatafeedTaskListener is even created. Similar to the FAILED check.

if (autodetectProcessManager.isNodeDying()) { return; }

But that is not a huge deal, it just reads a little strange.

…c#75850) * [ML] Ignore exceptions while opening job after SIGTERM to JVM We observed that some jobs failed during a rolling upgrade in Elastic Cloud. This happened because steps of the job open sequence failed with exceptions after core Elasticsearch services shut down in response to the SIGTERM. This change makes the persistent task executor for anomaly detection jobs ignore exceptions received after the JVM has received a shutdown signal, for example a SIGTERM. By doing nothing in response to such exceptions the persistent task remains in cluster state and will get assigned to a different node after the current node leaves the cluster. * Address review comment

elasticsearchmachine · 2021-07-30T10:21:59Z

💚 Backport successful

Status	Branch	Result
✅	7.x

#75872) * [ML] Ignore exceptions while opening job after SIGTERM to JVM We observed that some jobs failed during a rolling upgrade in Elastic Cloud. This happened because steps of the job open sequence failed with exceptions after core Elasticsearch services shut down in response to the SIGTERM. This change makes the persistent task executor for anomaly detection jobs ignore exceptions received after the JVM has received a shutdown signal, for example a SIGTERM. By doing nothing in response to such exceptions the persistent task remains in cluster state and will get assigned to a different node after the current node leaves the cluster. * Address review comment Co-authored-by: David Roberts <dave.roberts@elastic.co>

droberts195 added >bug :ml Machine learning v8.0.0 v7.15.0 labels Jul 29, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jul 29, 2021

benwtrent approved these changes Jul 29, 2021

View reviewed changes

Address review comment

ce720ac

droberts195 added auto-backport Automatically create backport pull requests when merged auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jul 30, 2021

elasticsearchmachine merged commit c287841 into elastic:master Jul 30, 2021

elasticsearchmachine mentioned this pull request Jul 30, 2021

[7.x] [ML] Ignore exceptions while opening job after SIGTERM to JVM (#75850) #75872

Merged

droberts195 deleted the ignore_exceptions_when_node_dying branch July 30, 2021 10:23

mark-vieira added v8.0.0-alpha1 and removed v8.0.0 labels Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

droberts195 commented Jul 29, 2021 •

edited

Loading

elasticmachine commented Jul 29, 2021

benwtrent Jul 29, 2021

elasticsearchmachine commented Jul 30, 2021

[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

Conversation

droberts195 commented Jul 29, 2021 • edited Loading

elasticmachine commented Jul 29, 2021

benwtrent Jul 29, 2021

Choose a reason for hiding this comment

elasticsearchmachine commented Jul 30, 2021

💚 Backport successful

droberts195 commented Jul 29, 2021 •

edited

Loading