New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360
Conversation
I would suggest that we adopt the following pattern for all the places like the one in this pull request where we catch Throwables: try {
...
} catch (Throwable t) {
ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
// the other handling logic...
} This requires to add the function It would be even nicer if we could do something like Scala supports, but I think there is no way better way to do this in Java than the way suggested above try {
...
} catch {
case NonFatal(t) => // does not include OOM and internal errors
} |
@StephanEwen , thank you for so quick reviews! That is a good idea to add the uniform way in the utils, so we can use that in anywhere. I will fix it as your suggestions later today. |
… process async message in akka rpc actor
… process async message in akka rpc actor
1365c6d
to
ad6362b
Compare
@StephanEwen , already submit the modifications. |
I think adding this safety net makes sense and protects against a corrupted state. However, isn't the root cause of the described problem that the JobMaster-TaskExecutor communication does not tolerate a lost message? Maybe we should introduce an acknowledge message which signals the correct reception of the status message. If the response times out we can either retry to send it or fail. |
Hi @tillrohrmann , thank you for reviews and positive suggestions! I try to explain the root case of this issue first: From JobMaster side, it sends the cancel rpc message and gets the acknowledge from TaskExecutor, then the execution state transition to From TaskExecutor side, it would notify the final state to JobMaster before task exits. The
The problem is that it may cause OOM before trigger For the key point interaction between TaskExecutor and JobMaster, it should not tolerate lost message, and I agree with your above suggestions. So there may be two ideas for this improvement:
And I prefers the first way to just make it sense in one side, avoid the complex interaction between TaskExecutor and JobMaster. Wish your further suggestions or any ideas. |
Looking at this from another angle: If any Runnable that is scheduled ever lets an exception bubble out, can we still assume that the JobManager is in a sane state? Or should be actually make every uncaught exception in the RPC executors a fatal error and send a |
Thanks for the clarification @zhijiangW. I know understand the problem that we effectively introduce via I agree with Stephan that it's hard to reason about the consistency of the Related to this is also how we handle exceptions in the To follow a similar pattern for the |
@StephanEwen , if the exception is bubbled out, and cause TaskExecutor to exit as a result, I think the JobMaster can be assumed as a sane state in final based on detection of TaskExecutor failure. The current solution just refers to OOM error, maybe it can extend to any exceptions, because it is difficult to confirm the consistency of the internal state and the conservative approach is to let it terminate as @tillrohrmann said. If I understand correctly from @tillrohrmann 's suggestions, the |
Merging this... |
…errors during process async message in akka rpc actor This closes apache#3360
…errors during process async message in akka rpc actor This closes apache#3360
…errors during process async message in akka rpc actor This closes apache#3360
…errors during process async message in akka rpc actor This closes apache#3360
If caught OOM error during process async messages in
AkkaRpcActor
, it will bring ambiguous behavior and may lost rpc messages. If the message is for notifying final state inTaskExecutor
, it will result inJobMaster
can not receive final state any more during process failing job, and may cause job stuck in final.The solution is to catch this special error in
AkkaRpcActor
and throw it, then it will result in shutting downActorSystem
and exitingTaskExecutor
process. So theJobMaster
can be aware of that and make the job restart if necessary.BTW, the
mvn clean verify
is successful in my machine.