-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert not same executor when completing future #108934
Assert not same executor when completing future #108934
Conversation
A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors.
Opening as draft for now to see how CI behaves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. I think we can make this even stronger tho, we can still deadlock if we have a loop of executors all waiting on each other to complete. Can we declare up-front all pairs of executors that can be in a block/complete relationship and verify that this permits no such cycles?
@@ -266,6 +267,27 @@ public static String threadName(final String nodeName, final String namePrefix) | |||
return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]"; | |||
} | |||
|
|||
// to be used in assertions only. | |||
public static boolean differentExecutors(Thread thread1, Thread thread2) { | |||
assert thread1 != thread2 : "only call for different threads"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to return false
here rather than failing with an unhelpful message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was sort of on purpose in that, the method should only be used in situations where you are clearly comparing different threads. It seems that if you use this for the same thread your assertion is not written correctly. Hence more of an assertion on the assertion. Not too tied to this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, yeah, we can't be completing on the waiting thread because that thread is waiting...
(deserves a comment to avoid redoing that thinking tho)
server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea!
@@ -266,6 +267,26 @@ public static String threadName(final String nodeName, final String namePrefix) | |||
return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]"; | |||
} | |||
|
|||
// to be used in assertions only. | |||
public static boolean differentExecutors(Thread thread1, Thread thread2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
public static boolean differentExecutors(Thread thread1, Thread thread2) { | |
public static boolean assertDifferentExecutors(Thread thread1, Thread thread2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that first, but since it no longer asserts that the executors are different but rather returns whether they are, I renamed it to avoid confusion.
// visible for tests | ||
static String executorName(Thread thread) { | ||
String name = thread.getName(); | ||
int executorNameEnd = name.lastIndexOf(']', name.length() - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit A comment on why name.length() - 2
would be helpful. Why is it like that? I understand that probably there's another ]
you'd like to skip? But I don't remember how the thread name is formatted out of the top of my head. Reading
public static String threadName(final String nodeName, final String namePrefix) {
// TODO missing node names should only be allowed in tests
return "elasticsearch" + (nodeName.isEmpty() ? "" : "[") + nodeName + (nodeName.isEmpty() ? "" : "]") + "[" + namePrefix + "]";
}
and assuming namePrefix is the executor name, I'm not sure where the other ]
would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread naming is something like XXXX[executor-name][T#NNN]
. We need to skip the running number part. The index passed into lastIndexOf
is inclusive.
I added a short comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use a regex here? maybe that's clearer? (famous last words)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time there's even two obligatory XCKDs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the manual parsing done here, seems simplest.
Sounds like a good idea. For now, I'd like to ensure this can be made to pass CI at all, I think there could be many cases of just this first assertion failing. I might defer your suggestion to a follow-up. |
As suggested in elastic#108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor.
As suggested in elastic#108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good
// visible for tests | ||
static String executorName(Thread thread) { | ||
String name = thread.getName(); | ||
int executorNameEnd = name.lastIndexOf(']', name.length() - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use a regex here? maybe that's clearer? (famous last words)
server/src/main/java/org/elasticsearch/index/engine/Engine.java
Outdated
Show resolved
Hide resolved
As suggested in #108934, we can extract the exact executor name from the thread name with some simple string manipulations. Using this utility, this commit tightens up the existing assertions about the current executor. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
Too many issues to fix in one PR, add a class that is used where we rely on notifying on same thread to at least have visibility.
…tor_when_completing_future
* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This | ||
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future. | ||
*/ | ||
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to add this unsafe variant that is then used in all places where I found conflicts, since fixing them all would make this PR size unmanageable. Using this we can now fix them one by one, spreading out the load too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 can we mark this as @Deprecated(forRemoval = true)
so that IDEs highlight its usages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a task/issue/ticket on fixing them one-by-one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one by one is too granular. I'll probably create area level ones (perhaps a bit more granular, for instance the one in AbstractClient
needs a separate one).
This now has a clean build so marking ready for review. I'll do more CI work to scrape up more occurrences before merge anyway (which can result in more unsafe future usages). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, I like it.
* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This | ||
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future. | ||
*/ | ||
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 can we mark this as @Deprecated(forRemoval = true)
so that IDEs highlight its usages?
private static class RefCountedFuture<R extends RefCounted> extends PlainActionFuture<R> { | ||
// todo: the use of UnsafePlainActionFuture here is quite broad, we should find a better way to be more specific | ||
// (unless making all usages safe is easy). | ||
private static class RefCountedFuture<R extends RefCounted> extends UnsafePlainActionFuture<R> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. I suspect we can/should move all usages of this into tests, and do the ref-counting (and asyncification) properly in prod code. But I see that's not a small change. Ok for now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also pondered on this for a short while, but decided to defer this issue for now. I think I agree to move to all prod client interactions being async now. If it is too big we would need to stop notifying on generic thread pool (one more pool maybe, hopefully as an interim step). We can discuss more when we tackle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as CI is happy
* This future is unsafe, since it allows notifying the future on the same thread pool executor that it is being waited on. This | ||
* is a common deadlock scenario, since all threads may be waiting and thus no thread may be able to complete the future. | ||
*/ | ||
public class UnsafePlainActionFuture<T> extends PlainActionFuture<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a task/issue/ticket on fixing them one-by-one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice idea!
Enable the assertion introduced in elastic#108934
Enable the assertion introduced in #108934
A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors. Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.
Enable the assertion introduced in elastic#108934
A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors. Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.
Enable the assertion introduced in elastic#108934
A common deadlock pattern is waiting and completing a future on the same executor. This only works until the executor is fully depleted of threads. Now assert that waiting for a future to be completed and the completion happens on different executors.
Introduced UnsafePlainActionFuture, used in all offending places, allowing those to be tackled independently.