Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix possible race condition when starting datafeed #51302

Merged
merged 1 commit into from
Jan 22, 2020

Conversation

droberts195
Copy link
Contributor

The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads. This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes #51285

The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes elastic#51285
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@davidkyle
Copy link
Member

I think #49467 may be caused by the same race

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor Author

I think #49467 may be caused by the same race

Maybe. If it was then there would have been an NPE logged on one of the external cluster nodes during the failed test. Unfortunately the Jenkins builds for the failures of #49467 have now gone, so we cannot download the external cluster logs to check.

@davidkyle
Copy link
Member

I would have spotted in NPE in the logs when triaging #49467. That test timed out waiting for the datafeed to stop, my theory is the NPE took down the worker thread that is vital in stopping the DF so a stopped status was never reached.

@droberts195 droberts195 merged commit cd0857c into elastic:master Jan 22, 2020
@droberts195 droberts195 deleted the fix_datafeed_running_race branch January 22, 2020 11:30
droberts195 added a commit that referenced this pull request Jan 22, 2020
The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes #51285
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Jan 29, 2020
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in elastic#50886.

Related to elastic#50886 and elastic#51302
henningandersen added a commit that referenced this pull request Jan 30, 2020
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in #50886.

Related to #50886 and #51302
henningandersen added a commit that referenced this pull request Jan 30, 2020
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in #50886.

Related to #50886 and #51302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] MachineLearningLicensingTests] [testAutoCloseJobWithDatafeed fails with NPE
5 participants