Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert not same executor when completing future #108934

Conversation

henningandersen
Copy link
Contributor

@henningandersen henningandersen commented May 23, 2024

A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors.

Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.

A common deadlock pattern is waiting and completing a future on the same executor.
This only works until the executor is fully depleted of threads. Now assert that
waiting for a future to be completed and the completion happens on different
executors.
@henningandersen
Copy link
Contributor Author

Opening as draft for now to see how CI behaves.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. I think we can make this even stronger tho, we can still deadlock if we have a loop of executors all waiting on each other to complete. Can we declare up-front all pairs of executors that can be in a block/complete relationship and verify that this permits no such cycles?

@@ -266,6 +267,27 @@ public static String threadName(final String nodeName, final String namePrefix)
return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
}

// to be used in assertions only.
public static boolean differentExecutors(Thread thread1, Thread thread2) {
assert thread1 != thread2 : "only call for different threads";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to return false here rather than failing with an unhelpful message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was sort of on purpose in that, the method should only be used in situations where you are clearly comparing different threads. It seems that if you use this for the same thread your assertion is not written correctly. Hence more of an assertion on the assertion. Not too tied to this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, yeah, we can't be completing on the waiting thread because that thread is waiting...

(deserves a comment to avoid redoing that thinking tho)

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea!

@@ -266,6 +267,26 @@ public static String threadName(final String nodeName, final String namePrefix)
return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
}

// to be used in assertions only.
public static boolean differentExecutors(Thread thread1, Thread thread2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
public static boolean differentExecutors(Thread thread1, Thread thread2) {
public static boolean assertDifferentExecutors(Thread thread1, Thread thread2) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that first, but since it no longer asserts that the executors are different but rather returns whether they are, I renamed it to avoid confusion.

// visible for tests
static String executorName(Thread thread) {
String name = thread.getName();
int executorNameEnd = name.lastIndexOf(']', name.length() - 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit A comment on why name.length() - 2 would be helpful. Why is it like that? I understand that probably there's another ] you'd like to skip? But I don't remember how the thread name is formatted out of the top of my head. Reading

    public static String threadName(final String nodeName, final String namePrefix) {
        // TODO missing node names should only be allowed in tests
        return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
    }

and assuming namePrefix is the executor name, I'm not sure where the other ] would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread naming is something like XXXX[executor-name][T#NNN]. We need to skip the running number part. The index passed into lastIndexOf is inclusive.

I added a short comment on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use a regex here? maybe that's clearer? (famous last words)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time there's even two obligatory XCKDs:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the manual parsing done here, seems simplest.

@henningandersen
Copy link
Contributor Author

Nice idea. I think we can make this even stronger tho, we can still deadlock if we have a loop of executors all waiting on each other to complete. Can we declare up-front all pairs of executors that can be in a block/complete relationship and verify that this permits no such cycles?

Sounds like a good idea. For now, I'd like to ensure this can be made to pass CI at all, I think there could be many cases of just this first assertion failing. I might defer your suggestion to a follow-up.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 23, 2024
As suggested in elastic#108934, we can extract the exact executor name from the
thread name with some simple string manipulations. Using this utility,
this commit tightens up the existing assertions about the current
executor.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 23, 2024
As suggested in elastic#108934, we can extract the exact executor name from the
thread name with some simple string manipulations. Using this utility,
this commit tightens up the existing assertions about the current
executor.

Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good

// visible for tests
static String executorName(Thread thread) {
String name = thread.getName();
int executorNameEnd = name.lastIndexOf(']', name.length() - 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use a regex here? maybe that's clearer? (famous last words)

@henningandersen henningandersen added the :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. label May 23, 2024
elasticsearchmachine pushed a commit that referenced this pull request May 23, 2024
As suggested in #108934, we can extract the exact executor name from the
thread name with some simple string manipulations. Using this utility,
this commit tightens up the existing assertions about the current
executor.

Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
*/
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to add this unsafe variant that is then used in all places where I found conflicts, since fixing them all would make this PR size unmanageable. Using this we can now fix them one by one, spreading out the load too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 can we mark this as @Deprecated(forRemoval = true) so that IDEs highlight its usages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a task/issue/ticket on fixing them one-by-one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one by one is too granular. I'll probably create area level ones (perhaps a bit more granular, for instance the one in AbstractClient needs a separate one).

@henningandersen henningandersen marked this pull request as ready for review May 27, 2024 08:34
@henningandersen
Copy link
Contributor Author

This now has a clean build so marking ready for review. I'll do more CI work to scrape up more occurrences before merge anyway (which can result in more unsafe future usages).

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, I like it.

* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
*/
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 can we mark this as @Deprecated(forRemoval = true) so that IDEs highlight its usages?

private static class RefCountedFuture<R extends RefCounted> extends PlainActionFuture<R> {
// todo: the use of UnsafePlainActionFuture here is quite broad, we should find a better way to be more specific
// (unless making all usages safe is easy).
private static class RefCountedFuture<R extends RefCounted> extends UnsafePlainActionFuture<R> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. I suspect we can/should move all usages of this into tests, and do the ref-counting (and asyncification) properly in prod code. But I see that's not a small change. Ok for now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also pondered on this for a short while, but decided to defer this issue for now. I think I agree to move to all prod client interactions being async now. If it is too big we would need to stop notifying on generic thread pool (one more pool maybe, hopefully as an interim step). We can discuss more when we tackle it.

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as CI is happy

* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future.
*/
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a task/issue/ticket on fixing them one-by-one?

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice idea!

@henningandersen henningandersen merged commit a4bd256 into elastic:main May 28, 2024
14 checks passed
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request May 28, 2024
Enable the assertion introduced in elastic#108934
@henningandersen henningandersen added the Team:Distributed Meta label for distributed team label May 28, 2024
henningandersen added a commit that referenced this pull request May 28, 2024
Enable the assertion introduced in #108934
JVerwolf pushed a commit to JVerwolf/elasticsearch that referenced this pull request May 28, 2024
A common deadlock pattern is waiting and completing a future on the same executor.
This only works until the executor is fully depleted of threads. Now assert that
waiting for a future to be completed and the completion happens on different
executors.

Introduced UnsafePlainActionFuture, used in all offending places, allowing those
to be tackled independently.
JVerwolf pushed a commit to JVerwolf/elasticsearch that referenced this pull request May 28, 2024
craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Jun 11, 2024
A common deadlock pattern is waiting and completing a future on the same executor.
This only works until the executor is fully depleted of threads. Now assert that
waiting for a future to be completed and the completion happens on different
executors.

Introduced UnsafePlainActionFuture, used in all offending places, allowing those
to be tackled independently.
craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >non-issue Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants