Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Ignore exceptions while opening job after SIGTERM to JVM #75850

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Jul 29, 2021

We observed that some jobs failed during a rolling upgrade
in Elastic Cloud. This happened because steps of the job
open sequence failed with exceptions after core Elasticsearch
services shut down in response to the SIGTERM.

This change makes the persistent task executor for anomaly
detection jobs ignore exceptions received after the JVM has
received a shutdown signal, for example a SIGTERM. By doing
nothing in response to such exceptions the persistent task
remains in cluster state and will get assigned to a different
node after the current node leaves the cluster.

We observed that some jobs failed during a rolling upgrade
in Elastic Cloud.  This happened because steps of the job
open sequence failed with exceptions after core Elasticsearch
services shut down in response to the SIGTERM.

This change makes the persistent task executor for anomaly
detection jobs ignore exceptions received after the JVM has
received a shutdown signal, for example a SIGTERM.  By doing
nothing in response to such exceptions the persistent task
remains in cluster state and will get assigned to a different
node after the current node leaves the cluster.
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Comment on lines 278 to 280
if (autodetectProcessManager.isNodeDying() == false) {
hasRunningDatafeedTask(jobTask.getJobId(), hasRunningDatafeedTaskListener);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would like there to be a predicate before hasRunningDatafeedTaskListener is even created. Similar to the FAILED check.

if (autodetectProcessManager.isNodeDying()) {
    return;
}

But that is not a huge deal, it just reads a little strange.

@droberts195 droberts195 added auto-backport Automatically create backport pull requests when merged auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jul 30, 2021
@elasticsearchmachine elasticsearchmachine merged commit c287841 into elastic:master Jul 30, 2021
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 30, 2021
…c#75850)

* [ML] Ignore exceptions while opening job after SIGTERM to JVM

We observed that some jobs failed during a rolling upgrade
in Elastic Cloud.  This happened because steps of the job
open sequence failed with exceptions after core Elasticsearch
services shut down in response to the SIGTERM.

This change makes the persistent task executor for anomaly
detection jobs ignore exceptions received after the JVM has
received a shutdown signal, for example a SIGTERM.  By doing
nothing in response to such exceptions the persistent task
remains in cluster state and will get assigned to a different
node after the current node leaves the cluster.

* Address review comment
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.x

@droberts195 droberts195 deleted the ignore_exceptions_when_node_dying branch July 30, 2021 10:23
elasticsearchmachine added a commit that referenced this pull request Jul 30, 2021
#75872)

* [ML] Ignore exceptions while opening job after SIGTERM to JVM

We observed that some jobs failed during a rolling upgrade
in Elastic Cloud.  This happened because steps of the job
open sequence failed with exceptions after core Elasticsearch
services shut down in response to the SIGTERM.

This change makes the persistent task executor for anomaly
detection jobs ignore exceptions received after the JVM has
received a shutdown signal, for example a SIGTERM.  By doing
nothing in response to such exceptions the persistent task
remains in cluster state and will get assigned to a different
node after the current node leaves the cluster.

* Address review comment

Co-authored-by: David Roberts <dave.roberts@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :ml Machine learning Team:ML Meta label for the ML team v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants