[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360

zhijiangW · 2017-02-20T10:09:30Z

If caught OOM error during process async messages in AkkaRpcActor, it will bring ambiguous behavior and may lost rpc messages. If the message is for notifying final state in TaskExecutor, it will result in JobMaster can not receive final state any more during process failing job, and may cause job stuck in final.

The solution is to catch this special error in AkkaRpcActor and throw it, then it will result in shutting down ActorSystem and exiting TaskExecutor process. So the JobMaster can be aware of that and make the job restart if necessary.

BTW, the mvn clean verify is successful in my machine.

StephanEwen · 2017-02-20T11:22:27Z

I would suggest that we adopt the following pattern for all the places like the one in this pull request where we catch Throwables:

try {
    ...
} catch (Throwable t) {
    ExceptionUtils.rethrowIfFatalErrorOrOOM(t);

     // the other handling logic...
}

This requires to add the function rethrowIfFatalErrorOrOOM(Throwable) to the ExceptionUtils, similar to the method here: https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/ExceptionUtils.java#L109

It would be even nicer if we could do something like Scala supports, but I think there is no way better way to do this in Java than the way suggested above

try {
    ...
} catch {
    case NonFatal(t) => // does not include OOM and internal errors
}

zhijiangW · 2017-02-21T03:01:34Z

@StephanEwen , thank you for so quick reviews!

That is a good idea to add the uniform way in the utils, so we can use that in anywhere.

I will fix it as your suggestions later today.

… process async message in akka rpc actor

zhijiangW · 2017-02-21T06:40:11Z

@StephanEwen , already submit the modifications.

tillrohrmann · 2017-02-21T10:07:49Z

I think adding this safety net makes sense and protects against a corrupted state.

However, isn't the root cause of the described problem that the JobMaster-TaskExecutor communication does not tolerate a lost message? Maybe we should introduce an acknowledge message which signals the correct reception of the status message. If the response times out we can either retry to send it or fail.

zhijiangW · 2017-02-22T03:07:03Z

Hi @tillrohrmann , thank you for reviews and positive suggestions!

I try to explain the root case of this issue first:

From JobMaster side, it sends the cancel rpc message and gets the acknowledge from TaskExecutor, then the execution state transition to CANCELING.

From TaskExecutor side, it would notify the final state to JobMaster before task exits. The notifyFinalState can be divided into two steps:

Execute the RunAsync message by akka actor and this is a tell action, and it will trigger to run unregisterTaskAndNotifyFinalState.
In process of unregisterTaskAndNotifyFinalState, it will trigger the rpc message of updateTaskExecutionState, and it is a ask action, so the mechanism can avoid lost message.

The problem is that it may cause OOM before trigger updateTaskExecutionState, and this error is caught by AkkaRpcActor and does not do anything resulting in interrupting the following process. The updateTaskExecutionState will not be executed anymore.

For the key point interaction between TaskExecutor and JobMaster, it should not tolerate lost message, and I agree with your above suggestions. So there may be two ideas for this improvement:

Enhance the robustness of notifyFinalState, and the current rethrow OOM is an easy option, but it will cause the TaskExecutor exit，there should be other ways to make the cost reduction.
After get cancel acknowledge in JobMaster side, it can trigger a timeout to check the execution final state. If the execution has not entered the final state within timeout, the JobMaster can resend the acknowledge message to TaskExecutor to confirm the status.

And I prefers the first way to just make it sense in one side, avoid the complex interaction between TaskExecutor and JobMaster.

Wish your further suggestions or any ideas.

StephanEwen · 2017-02-22T10:49:58Z

Looking at this from another angle: If any Runnable that is scheduled ever lets an exception bubble out, can we still assume that the JobManager is in a sane state? Or should be actually make every uncaught exception in the RPC executors a fatal error and send a notifyFatalError to the RpcEndpoint?

tillrohrmann · 2017-02-22T11:15:38Z

Thanks for the clarification @zhijiangW. I know understand the problem that we effectively introduce via RpcEndpoint.runAsync another message which might get "lost" (e.g. due to OOM exception).

I agree with Stephan that it's hard to reason about the consistency of the AkkaRpcActor s internal state once we see an exception. The conservative approach would probably be to let it terminate or calling notifyFatalError to handle it.

Related to this is also how we handle exceptions in the AkkaRpcActor.handleRpcInvocation. There we catch all exception and simply send them to the caller. I think in this method we should only send the non-fatal exceptions back and terminate otherwise.

To follow a similar pattern for the handleRunAsync we could think about returning a Future which we return when calling RpcEndpoint.runAsync which will be completed with non fatal exceptions or if the Runnable has been executed. And in case that we see a fatal exception we terminate or call notifyFatalError. What do you think?

zhijiangW · 2017-02-23T09:26:07Z

@StephanEwen , if the exception is bubbled out, and cause TaskExecutor to exit as a result, I think the JobMaster can be assumed as a sane state in final based on detection of TaskExecutor failure.

The current solution just refers to OOM error, maybe it can extend to any exceptions, because it is difficult to confirm the consistency of the internal state and the conservative approach is to let it terminate as @tillrohrmann said.

If I understand correctly from @tillrohrmann 's suggestions, the RpcEndpoint.runAsync method would modify to return a Future that is similar with RpcEndpoint.callAsync, but still a Tell action in akka. And this Future should be set as a field in RunAsync in order to get it when handle in AkkaRpcActor. The Future can help to determine whether the message is executed successfully or lost to enhance the Tell mechanism. If the Future with Tell action is better than current RpcEndpoint .callAsync** which refers to Ask action, I will try to do for that. Or another option is just tolerating the lost message in current RpcEndpoint.runAsync, and it should be used in such scenarios for efficiency and not safe. For the important interaction, it should resort to RpcEndpoint.callAsync. What do you think?

StephanEwen · 2017-03-09T09:04:40Z

Merging this...

…errors during process async message in akka rpc actor This closes apache#3360

淘江 added 2 commits February 21, 2017 14:36

[FLINK-5830][Distributed Coordination]Handle OutOfMemory error during…

3108f6e

… process async message in akka rpc actor

[FLINK-5830][Distributed Coordination]Handle OutOfMemory error during…

ad6362b

… process async message in akka rpc actor

zhijiangW force-pushed the FLINK-5830 branch from 1365c6d to ad6362b Compare February 21, 2017 06:38

zhijiangW changed the title ~~[FLINK-5830][Distributed Coordination] Handle OutOfMemory error during process async message in akka rpc actor~~ [FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor Mar 3, 2017

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-5830] [distributed coordination] Handle OutOfMemory and fatal …

27fe460

…errors during process async message in akka rpc actor This closes apache#3360

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-5830] [distributed coordination] Handle OutOfMemory and fatal …

c8147a0

…errors during process async message in akka rpc actor This closes apache#3360

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-5830] [distributed coordination] Handle OutOfMemory and fatal …

5da70d5

…errors during process async message in akka rpc actor This closes apache#3360

asfgit closed this in 527eabd Mar 9, 2017

p16i pushed a commit to p16i/flink that referenced this pull request Apr 16, 2017

[FLINK-5830] [distributed coordination] Handle OutOfMemory and fatal …

fbb85af

…errors during process async message in akka rpc actor This closes apache#3360

zhijiangW deleted the FLINK-5830 branch August 6, 2017 13:57

rmetzger added the component=<none> label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360

[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360

zhijiangW commented Feb 20, 2017 •

edited

StephanEwen commented Feb 20, 2017 •

edited

zhijiangW commented Feb 21, 2017

zhijiangW commented Feb 21, 2017

tillrohrmann commented Feb 21, 2017

zhijiangW commented Feb 22, 2017 •

edited

StephanEwen commented Feb 22, 2017

tillrohrmann commented Feb 22, 2017

zhijiangW commented Feb 23, 2017 •

edited

StephanEwen commented Mar 9, 2017

[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360

[FLINK-5830] [Distributed Coordination] [FLIP-6] Handle OutOfMemory error during process async message in akka rpc actor #3360

Conversation

zhijiangW commented Feb 20, 2017 • edited

StephanEwen commented Feb 20, 2017 • edited

zhijiangW commented Feb 21, 2017

zhijiangW commented Feb 21, 2017

tillrohrmann commented Feb 21, 2017

zhijiangW commented Feb 22, 2017 • edited

StephanEwen commented Feb 22, 2017

tillrohrmann commented Feb 22, 2017

zhijiangW commented Feb 23, 2017 • edited

StephanEwen commented Mar 9, 2017

zhijiangW commented Feb 20, 2017 •

edited

StephanEwen commented Feb 20, 2017 •

edited

zhijiangW commented Feb 22, 2017 •

edited

zhijiangW commented Feb 23, 2017 •

edited