Machine learning avoid thread pool deadlocks #109134

henningandersen · 2024-05-28T20:10:29Z

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

Linux

Problem Description

In #108934 we added assertions to ensure we do not complete a future on the same executor that waits for it, since this can lead to deadlocks. Two ML usages were identified that need to be fixed:

Ideally those would be converted to asynchronous waits instead.

Steps to Reproduce

NA

Logs (if relevant)

No response

Tasks

Give feedback

Convert InferenceRunner to Async
Convert TrainedModelBlahBlah to Async
Options

elasticsearchmachine · 2024-05-28T20:11:09Z

Pinging @elastic/ml-core (Team:ML)

prwhelan · 2024-05-30T13:46:23Z

Inference runner does not appear to wait for itself, it only appears to fork to another thread in TrainedModelProvider which I am assuming is the transport thread?

Same as TrainedModel... in DeploymentManager

Doesn't appear to be an immediate risk, but we can still remove the use of PlainActionFuture

henningandersen · 2024-05-30T17:39:18Z

You may be right about it having a similar underlying cause, but InferenceRunner waits on a future on a ml_utility thread that is then notified on another ml_utility thread (I do not have the details, but if you convert the UnsafePlainActionFuture to a PlainActionFuture and run CI, you should see it (perhaps need a few times)). If you were so unfortunate to run out of ml_utility threads that are all blocked on such a future, it would be a deadlock.

I do have some details for TrainedModelAssignmentNodeService:

java.lang.AssertionError: cannot complete future on thread Thread[#114,elasticsearch[yamlRestTest-0][ml_utility][T#2],5,main] with waiter on thread Thread[#88,elasticsearch[yamlRestTest-0][ml_utility][T#1],5,main], could deadlock if pool was full
        at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:278)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:96)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:157)
        at org.elasticsearch.ml@8.15.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentNodeService.loadQueuedModels(TrainedModelAssignmentNodeService.java:212)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
        at org.elasticsearch.server@8.15.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1570)
---
        at org.elasticsearch.action.support.PlainActionFuture.assertCompleteAllowed(PlainActionFuture.java:416) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.set(PlainActionFuture.java:137) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.support.PlainActionFuture.onResponse(PlainActionFuture.java:37) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$3(DeploymentManager.java:181) ~[?:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:248) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$startAndLoad$2(DeploymentManager.java:546) ~[?:?]
        at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager$ProcessContext.lambda$loadModel$12(DeploymentManager.java:753) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.15.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1570) ~[?:?]

A future change will refactor how InferenceRunner's `run` method functions, and I want to reuse the existing unit tests to verify the behavior does not change. Relate elastic#109134

A future change will refactor how InferenceRunner's `run` method functions, and I want to reuse the existing unit tests to verify the behavior does not change. Relate #109134

InferenceStep now passes a Listener to InferenceRunner. InferenceRunner will chain the listener in the `run` method such that the nested model loading and inference happen asynchronously without blocking threads. Relate elastic#109134

The model loading scheduled thread iterates through the model queue and deploys each model. Rather than block and wait on each deployment, the thread will attach a listener that will either iterate to the next model (if one is in the queue) or reschedule the thread. This change should not impact: 1. the iterative nature of the model deployment process - each model is still deployed one at a time, and no additional threads are consumed per model. 2. the 1s delay between model deployment tries - if a deployment fails but can be retried, the retry is added to the next batch of models that are consumed after the 1s scheduled delay. Relate elastic#109134

InferenceStep now passes a Listener to InferenceRunner. InferenceRunner will chain the listener in the `run` method such that the nested model loading and inference happen asynchronously without blocking threads. Relate #109134 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

henningandersen added >bug needs:triage Requires assignment of a team area label :ml Machine learning and removed needs:triage Requires assignment of a team area label labels May 28, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label May 28, 2024

dimkots assigned dimkots and unassigned dimkots May 29, 2024

prwhelan self-assigned this May 29, 2024

prwhelan mentioned this issue May 30, 2024

[ML] Refactor InferenceRunner #109228

Merged

prwhelan added a commit that referenced this issue May 31, 2024

[ML] Refactor InferenceRunner (#109228)

8295e18

A future change will refactor how InferenceRunner's `run` method functions, and I want to reuse the existing unit tests to verify the behavior does not change. Relate #109134

prwhelan mentioned this issue Jun 10, 2024

[ML] Avoid InferenceRunner deadlock #109551

Merged

prwhelan mentioned this issue Jun 13, 2024

[ML] Avoid ModelAssignment deadlock #109684

Merged

prwhelan closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine learning avoid thread pool deadlocks #109134

Machine learning avoid thread pool deadlocks #109134

henningandersen commented May 28, 2024 •

edited by prwhelan

Loading

Tasks

elasticsearchmachine commented May 28, 2024

prwhelan commented May 30, 2024 •

edited

Loading

henningandersen commented May 30, 2024

Machine learning avoid thread pool deadlocks #109134

Machine learning avoid thread pool deadlocks #109134

Comments

henningandersen commented May 28, 2024 • edited by prwhelan Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Tasks

elasticsearchmachine commented May 28, 2024

prwhelan commented May 30, 2024 • edited Loading

henningandersen commented May 30, 2024

henningandersen commented May 28, 2024 •

edited by prwhelan

Loading

prwhelan commented May 30, 2024 •

edited

Loading