Assert not same executor when completing future #108934

henningandersen · 2024-05-23T06:20:45Z

A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors.

Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.

A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors.

henningandersen · 2024-05-23T06:20:59Z

Opening as draft for now to see how CI behaves.

DaveCTurner

Nice idea. I think we can make this even stronger tho, we can still deadlock if we have a loop of executors all waiting on each other to complete. Can we declare up-front all pairs of executors that can be in a block/complete relationship and verify that this permits no such cycles?

DaveCTurner · 2024-05-23T07:52:27Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

@@ -266,6 +267,27 @@ public static String threadName(final String nodeName, final String namePrefix)
        return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
    }

+    // to be used in assertions only.
+    public static boolean differentExecutors(Thread thread1, Thread thread2) {
+        assert thread1 != thread2 : "only call for different threads";


Probably want to return false here rather than failing with an unhelpful message?

This was sort of on purpose in that, the method should only be used in situations where you are clearly comparing different threads. It seems that if you use this for the same thread your assertion is not written correctly. Hence more of an assertion on the assertion. Not too tied to this though.

Oh I see, yeah, we can't be completing on the waiting thread because that thread is waiting...

(deserves a comment to avoid redoing that thinking tho)

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

kingherc

Nice idea!

kingherc · 2024-05-23T08:06:18Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

@@ -266,6 +267,26 @@ public static String threadName(final String nodeName, final String namePrefix)
        return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
    }

+    // to be used in assertions only.
+    public static boolean differentExecutors(Thread thread1, Thread thread2) {


nit

Suggested change

public static boolean differentExecutors(Thread thread1, Thread thread2) {

public static boolean assertDifferentExecutors(Thread thread1, Thread thread2) {

I had that first, but since it no longer asserts that the executors are different but rather returns whether they are, I renamed it to avoid confusion.

kingherc · 2024-05-23T08:10:52Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+    // visible for tests
+    static String executorName(Thread thread) {
+        String name = thread.getName();
+        int executorNameEnd = name.lastIndexOf(']', name.length() - 2);


nit A comment on why name.length() - 2 would be helpful. Why is it like that? I understand that probably there's another ] you'd like to skip? But I don't remember how the thread name is formatted out of the top of my head. Reading

public static String threadName(final String nodeName, final String namePrefix) { // TODO missing node names should only be allowed in tests return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]"; }

and assuming namePrefix is the executor name, I'm not sure where the other ] would be.

The thread naming is something like XXXX[executor-name][T#NNN]. We need to skip the running number part. The index passed into lastIndexOf is inclusive.

I added a short comment on this.

Maybe we could use a regex here? maybe that's clearer? (famous last words)

This time there's even two obligatory XCKDs:

https://xkcd.com/208/

https://xkcd.com/1171/

I prefer the manual parsing done here, seems simplest.

henningandersen · 2024-05-23T08:51:40Z

Nice idea. I think we can make this even stronger tho, we can still deadlock if we have a loop of executors all waiting on each other to complete. Can we declare up-front all pairs of executors that can be in a block/complete relationship and verify that this permits no such cycles?

Sounds like a good idea. For now, I'd like to ensure this can be made to pass CI at all, I think there could be many cases of just this first assertion failing. I might defer your suggestion to a follow-up.

As suggested in elastic#108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor.

As suggested in elastic#108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>

fcofdez

This looks good

fcofdez · 2024-05-23T10:23:53Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+    // visible for tests
+    static String executorName(Thread thread) {
+        String name = thread.getName();
+        int executorNameEnd = name.lastIndexOf(']', name.length() - 2);


Maybe we could use a regex here? maybe that's clearer? (famous last words)

server/src/main/java/org/elasticsearch/index/engine/Engine.java

As suggested in #108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>

Too many issues to fix in one PR, add a class that is used where we rely on notifying on same thread to at least have visibility.

…tor_when_completing_future

henningandersen · 2024-05-27T08:33:44Z

server/src/main/java/org/elasticsearch/action/support/UnsafePlainActionFuture.java

+ * This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
+ * is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
+ */
+public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {


I opted to add this unsafe variant that is then used in all places where I found conflicts, since fixing them all would make this PR size unmanageable. Using this we can now fix them one by one, spreading out the load too.

👍 can we mark this as @Deprecated(forRemoval = true) so that IDEs highlight its usages?

Do we need a task/issue/ticket on fixing them one-by-one?

I think one by one is too granular. I'll probably create area level ones (perhaps a bit more granular, for instance the one in AbstractClient needs a separate one).

henningandersen · 2024-05-27T08:35:19Z

This now has a clean build so marking ready for review. I'll do more CI work to scrape up more occurrences before merge anyway (which can result in more unsafe future usages).

DaveCTurner

Great stuff, I like it.

DaveCTurner · 2024-05-27T08:45:34Z

server/src/main/java/org/elasticsearch/action/support/UnsafePlainActionFuture.java

+ * This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
+ * is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
+ */
+public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {


👍 can we mark this as @Deprecated(forRemoval = true) so that IDEs highlight its usages?

DaveCTurner · 2024-05-27T08:50:10Z

server/src/main/java/org/elasticsearch/client/internal/support/AbstractClient.java

-    private static class RefCountedFuture<R extends RefCounted> extends PlainActionFuture<R> {
+    // todo: the use of UnsafePlainActionFuture here is quite broad, we should find a better way to be more specific
+    // (unless making all usages safe is easy).
+    private static class RefCountedFuture<R extends RefCounted> extends UnsafePlainActionFuture<R> {


Yikes. I suspect we can/should move all usages of this into tests, and do the ref-counting (and asyncification) properly in prod code. But I see that's not a small change. Ok for now...

Yeah, I also pondered on this for a short while, but decided to defer this issue for now. I think I agree to move to all prod client interactions being async now. If it is too big we would need to stop notifying on generic thread pool (one more pool maybe, hopefully as an interim step). We can discuss more when we tackle it.

kingherc

LGTM as long as CI is happy

kingherc · 2024-05-27T08:56:10Z

server/src/main/java/org/elasticsearch/action/support/UnsafePlainActionFuture.java

+ * This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
+ * is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
+ */
+public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {


Do we need a task/issue/ticket on fixing them one-by-one?

tlrx

LGTM, nice idea!

…tor_when_completing_future

Enable the assertion introduced in elastic#108934

Enable the assertion introduced in #108934

A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors. Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.

Enable the assertion introduced in elastic#108934

A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors. Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.

Enable the assertion introduced in elastic#108934

henningandersen added >non-issue v8.15.0 labels May 23, 2024

Get proper debug info out of assertion.

d84d275

DaveCTurner reviewed May 23, 2024

View reviewed changes

Remove dummy assertion.

39450b7

kingherc reviewed May 23, 2024

View reviewed changes

henningandersen added 2 commits May 23, 2024 10:46

cs

ef2a6a4

comments.

a7b6a85

DaveCTurner mentioned this pull request May 23, 2024

Tighten up ThreadPool#assertCurrentThreadPool #108943

Merged

fcofdez reviewed May 23, 2024

View reviewed changes

nasty hack to see more issues.

41c54e1

henningandersen commented May 23, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/Engine.java Outdated Show resolved Hide resolved

henningandersen added the :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label May 23, 2024

henningandersen added 11 commits May 24, 2024 07:45

Add and use UnsafePlainActionFuture

7722a9e

Too many issues to fix in one PR, add a class that is used where we rely on notifying on same thread to at least have visibility.

CCR unsafety

1508a6f

Completion stats and searchable snapshot

43a477b

Disruption IT

4cc0c59

Transport test cases

7896110

Transport test case plus corrupted blob store IT

199676e

spotless

2e031fb

right executor

8de5f11

ml

6554796

security

356a846

Merge remote-tracking branch 'origin/main' into assert_not_same_execu…

a6466a2

…tor_when_completing_future

henningandersen added 4 commits May 25, 2024 21:16

Simplifiy

aec8017

ccr double encounter

c6f9c32

silly mistake

ce01fd9

inference runner

7205c23

henningandersen commented May 27, 2024

View reviewed changes

henningandersen marked this pull request as ready for review May 27, 2024 08:34

henningandersen requested review from fcofdez, kingherc and DaveCTurner May 27, 2024 08:35

DaveCTurner approved these changes May 27, 2024

View reviewed changes

kingherc approved these changes May 27, 2024

View reviewed changes

tlrx approved these changes May 27, 2024

View reviewed changes

henningandersen added 5 commits May 27, 2024 11:20

Mark deprecated.

ac36c97

Stateless prewarming

1edb078

spotless

1c1d48a

Disable assertion for merge.

bedbda8

Merge remote-tracking branch 'origin/main' into assert_not_same_execu…

0992112

…tor_when_completing_future

henningandersen merged commit a4bd256 into elastic:main May 28, 2024
14 checks passed

henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request May 28, 2024

Enable not same executor assertion

732ffe7

Enable the assertion introduced in elastic#108934

henningandersen mentioned this pull request May 28, 2024

Enable not same executor assertion #109088

Merged

henningandersen added the Team:Distributed Meta label for distributed team label May 28, 2024

henningandersen added a commit that referenced this pull request May 28, 2024

Enable not same executor assertion (#109088)

3fad074

Enable the assertion introduced in #108934

henningandersen mentioned this pull request May 28, 2024

Machine learning avoid thread pool deadlocks #109134

Open

JVerwolf pushed a commit to JVerwolf/elasticsearch that referenced this pull request May 28, 2024

Enable not same executor assertion (elastic#109088)

4ae30e5

Enable the assertion introduced in elastic#108934

craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Jun 11, 2024

Enable not same executor assertion (elastic#109088)

72f325b

Enable the assertion introduced in elastic#108934

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert not same executor when completing future #108934

Assert not same executor when completing future #108934

henningandersen commented May 23, 2024 •

edited

Loading

henningandersen commented May 23, 2024

DaveCTurner left a comment

DaveCTurner May 23, 2024

henningandersen May 23, 2024

DaveCTurner May 23, 2024

kingherc left a comment

kingherc May 23, 2024

henningandersen May 23, 2024

kingherc May 23, 2024

henningandersen May 23, 2024

fcofdez May 23, 2024

DaveCTurner May 23, 2024

henningandersen May 23, 2024

henningandersen commented May 23, 2024

fcofdez left a comment

fcofdez May 23, 2024

henningandersen May 27, 2024

DaveCTurner May 27, 2024

kingherc May 27, 2024

henningandersen May 27, 2024

henningandersen commented May 27, 2024

DaveCTurner left a comment

DaveCTurner May 27, 2024

DaveCTurner May 27, 2024

henningandersen May 27, 2024

kingherc left a comment

kingherc May 27, 2024

tlrx left a comment

	public static boolean differentExecutors(Thread thread1, Thread thread2) {
	public static boolean assertDifferentExecutors(Thread thread1, Thread thread2) {

Assert not same executor when completing future #108934

Assert not same executor when completing future #108934

Conversation

henningandersen commented May 23, 2024 • edited Loading

henningandersen commented May 23, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingherc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen commented May 23, 2024

fcofdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen commented May 27, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingherc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

henningandersen commented May 23, 2024 •

edited

Loading