Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine learning avoid thread pool deadlocks #109134

Closed
2 tasks done
henningandersen opened this issue May 28, 2024 · 3 comments
Closed
2 tasks done

Machine learning avoid thread pool deadlocks #109134

henningandersen opened this issue May 28, 2024 · 3 comments
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@henningandersen
Copy link
Contributor

henningandersen commented May 28, 2024

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

Linux

Problem Description

In #108934 we added assertions to ensure we do not complete a future on the same executor that waits for it, since this can lead to deadlocks. Two ML usages were identified that need to be fixed:

Ideally those would be converted to asynchronous waits instead.

Steps to Reproduce

NA

Logs (if relevant)

No response

Tasks

@henningandersen henningandersen added >bug needs:triage Requires assignment of a team area label :ml Machine learning and removed needs:triage Requires assignment of a team area label labels May 28, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label May 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@dimkots dimkots assigned dimkots and unassigned dimkots May 29, 2024
@prwhelan prwhelan self-assigned this May 29, 2024
@prwhelan
Copy link
Member

prwhelan commented May 30, 2024

Inference runner does not appear to wait for itself, it only appears to fork to another thread in TrainedModelProvider which I am assuming is the transport thread?

Same as TrainedModel... in DeploymentManager

Doesn't appear to be an immediate risk, but we can still remove the use of PlainActionFuture

@henningandersen
Copy link
Contributor Author

You may be right about it having a similar underlying cause, but InferenceRunner waits on a future on a ml_utility thread that is then notified on another ml_utility thread (I do not have the details, but if you convert the UnsafePlainActionFuture to a PlainActionFuture and run CI, you should see it (perhaps need a few times)). If you were so unfortunate to run out of ml_utility threads that are all blocked on such a future, it would be a deadlock.

I do have some details for TrainedModelAssignmentNodeService:

java.lang.AssertionError: cannot complete future on thread Thread[#114,elasticsearch[yamlRestTest-0][ml_utility][T#2],5,main] with waiter on thread Thread[#88,elasticsearch[yamlRestTest-0][ml_utility][T#1],5,main], could deadlock if pool was full
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:278)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:96)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:157)
        at org.elasticsearch.ml@8.15.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentNodeService.loadQueuedModels(TrainedModelAssignmentNodeService.java:212)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1570)
---
        at org.elasticsearch.action.support.PlainActionFuture.assertCompleteAllowed(PlainActionFuture.java:416) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.set(PlainActionFuture.java:137) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.onResponse(PlainActionFuture.java:37) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$3(DeploymentManager.java:181) ~[?:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:248) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$2(DeploymentManager.java:546) ~[?:?]
        at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$loadModel$12(DeploymentManager.java:753) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1570) ~[?:?]

prwhelan added a commit to prwhelan/elasticsearch that referenced this issue May 30, 2024
A future change will refactor how InferenceRunner's `run` method
functions, and I want to reuse the existing unit tests to verify the
behavior does not change.

Relate elastic#109134
prwhelan added a commit that referenced this issue May 31, 2024
A future change will refactor how InferenceRunner's `run` method
functions, and I want to reuse the existing unit tests to verify the
behavior does not change.

Relate #109134
prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Jun 10, 2024
InferenceStep now passes a Listener to InferenceRunner.  InferenceRunner
will chain the listener in the `run` method such that the nested model
loading and inference happen asynchronously without blocking threads.

Relate elastic#109134
prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Jun 12, 2024
The model loading scheduled thread iterates through the model
queue and deploys each model. Rather than block and wait on each
deployment, the thread will attach a listener that will either iterate
to the next model (if one is in the queue) or reschedule the thread.

This change should not impact:
1. the iterative nature of the model deployment process - each model is
   still deployed one at a time, and no additional threads are consumed
   per model.
2. the 1s delay between model deployment tries - if a deployment fails
   but can be retried, the retry is added to the next batch of models
   that are consumed after the 1s scheduled delay.

Relate elastic#109134
prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Jun 13, 2024
The model loading scheduled thread iterates through the model
queue and deploys each model. Rather than block and wait on each
deployment, the thread will attach a listener that will either iterate
to the next model (if one is in the queue) or reschedule the thread.

This change should not impact:
1. the iterative nature of the model deployment process - each model is
   still deployed one at a time, and no additional threads are consumed
   per model.
2. the 1s delay between model deployment tries - if a deployment fails
   but can be retried, the retry is added to the next batch of models
   that are consumed after the 1s scheduled delay.

Relate elastic#109134
prwhelan added a commit that referenced this issue Jun 18, 2024
InferenceStep now passes a Listener to InferenceRunner.  InferenceRunner
will chain the listener in the `run` method such that the nested model
loading and inference happen asynchronously without blocking threads.

Relate #109134

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@prwhelan prwhelan closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

4 participants