-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-15966][runtime] Capture callstacks for RPC ask() calls to improve exceptions. #11048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 04fe5b7 (Mon Feb 10 11:57:00 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
04fe5b7
to
7102a13
Compare
public class AkkaOptions { | ||
|
||
/** | ||
* Timeout for akka ask calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forget to modify this :) ?
|
||
// remove the stack frames coming from the proxy interface invocation | ||
final StackTraceElement[] stackTrace = callStackCapture.getStackTrace(); | ||
newException.setStackTrace(Arrays.copyOfRange(stackTrace, 3, stackTrace.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we filter the stack trace containing o.a.f.runtime.rpc.akka instead of using 3? Because others may change call stack of this function unintentionally in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to initially keep it like that. There is a test that guards this, so we should see it if someone breaks this by accident.
The check is slightly trickier than it looks, because the package name of the proxy class depends on visibility keyword of the interface, etc.
I would suggest to keep it simple for now and fix it when the test tells us that this simple solution is not enough. Then we should also know which cases cause it to break, what the involved classnames and package names are, etc.
|
||
static Throwable resolveTimeoutException(Throwable exception, @Nullable Throwable callStackCapture, Method method) { | ||
if (callStackCapture == null || (!(exception instanceof akka.pattern.AskTimeoutException))) { | ||
return exception; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null branch implies that the exception type is dependent on the configuration option, which sounds like an accident waiting to happen if someone wants to handle timeouts explicitly as they would have to know what there are 2 types of timeouts that can occur.
How problematic would it be to always create a TimeoutException, and only have different behaviors for how the stacktrace is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be fine.
The main reason that I wanted this configurable is to have a way to turn call stack capture off, because during large RPC storms, this might add significantly to the memory consumption.
But always using a TimeoutException
and referring to the RPC method name should be uncritical.
Merged in 2746e6a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late to the party but the changes look really nice. This will help big time when debugging RPC timeout exceptions :-)
This is a really nice feature for debugging, indeed! |
Should we backport this to 1.10, switched off by default? |
I couldn't find any direct usages of Apart from that the danger might be that users rely on Hence, we could backport it but if we want to be on the safe side, then we only introduce it with Flink 1.11. |
@tillrohrmann actually when I internally implement a job client depend on RPC, to support properly retry it depends on |
Happy to do it, but not in a bug fix release. |
What is the purpose of the change
This change preserves the call stack from the invocation of RPC ask() calls. When the call fails, this allows us to add a meaningful exception pointing to the original call site, rather than having just some weird stack trace from akka.
Note: While it is possible to do this for all types of ask() failures, this PR uses this approach for now only for timeouts.
Former exception message and stack trace:
New exceptions and stack trace:
Brief change log
akka.ask.callstack
(default: true)Verifying this change
You can verify this with any Flink deployment where something times out, for example deploy/cancel calls when a TaskManager is killed
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation