[REEF-1447] Validate Task close failure => FailedEvaluator Event (task has… #1042
Conversation
@@ -110,9 +110,9 @@ public Thread StartTaskOnNewThread() | |||
"Task running result:\r\n" + System.Text.Encoding.Default.GetString(result)); | |||
} | |||
} | |||
catch (TaskStopHandlerException e) | |||
catch (TaskStopHandlerException stopException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the clarification. But this is a pure syntactic change, maybe leave it out of this PR?
I've done a pass. |
@markusweimer replied to your comments. The current design only reports the earliest |
@markusweimer what are your thoughts on how to handle multiple Exceptions. Should we just log and ignore? Or should we create |
@@ -98,28 +98,48 @@ public void SetException(Exception e) | |||
{ | |||
lock (_heartBeatManager) | |||
{ | |||
try | |||
if (_lastException.IsPresent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to ensure failed task is only send once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I have done a pass. |
@jwang98052 @markusweimer I'll resolve the conflict once we reach a consensus on how to handle multiple |
{ | ||
if (!_lastException.IsPresent()) | ||
{ | ||
throw new ApplicationException("Expected failed Task to already have set an Exception."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is optional in SetException...do we really want to generate new exception here?
To clarify, it seem to check for integrity of the Failed state in wrong place: If we want to fast fail we should fail in SetException().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only reach _state = Failed
in SetException
, and SetException
has to set _lastException
to a "present" Optional
. Everything is also done with a lock held on _heartbeatManager
. Thus, if we ever reach a state where _state == Failed
but _lastException
is not "present", we must have had some sort of application logic and we should not call SetException
again. Instead, we should crash the Evaluator with an ApplicationException
that is not caught in the upper layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, can we crash evaluator in the SetException right away?
Basically if we want to guarantee state integrity, it is easier to do during state change, otherwise we may need to do these checks in multiply places afterwards.
In reply to: 67907809 [](ancestors = 67907809)
@jwang98052 @markusweimer @andrey-me This is ready for another round. Thanks! @jwang98052 note that we have changed the behavior of an |
@jwang98052 for the motivation of the change, please read the JIRA. Thanks! |
It does break the current implementation of IMRU fault tolerant. I have opened REEF-1466 for the solution/discussion. |
@jwang98052 As discussed over on JIRA, the reliance on exceptions thrown by a |
@@ -41,6 +41,7 @@ internal sealed class TaskRuntime : IObserver<ICloseEvent>, IObserver<ISuspendEv | |||
private readonly IInjectionFuture<IObserver<ISuspendEvent>> _suspendHandlerFuture; | |||
private readonly IInjectionFuture<IObserver<ICloseEvent>> _closeHandlerFuture; | |||
private int _taskRan = 0; | |||
private int _taskClosed = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool
does not work with Interlocked
.
I've done another pass. |
} | ||
|
||
State = TaskState.Failed; | ||
_taskLifeCycle.Stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if task exception happens, don't call TaskStopHandler, right? In another word, TaskStopHandler is only for happen scenario, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@markusweimer After REEF-1466 is resolved, we should be fine with this change. In REEF-1466, we will use cancellation token to let the task return properly from close handler instead of throw exception. |
@markusweimer @jwang98052 I've addressed your comments, please have another look. Thanks! |
@jwang98052 is this ready to be merged? I've updated IMRUCloseTaskTest in this PR to handle the failed Evaluator scenario, that is why it is passing. |
@afchung I see you modified TaskCloseTest, but didn't find the change for IMRUCloseTaskTest. Did I miss it? |
@jwang98052 it seems like I didn't need to change it since the Tasks always complete. My understanding taking a glance as the code is that what the test does is it just runs broadcast and reduce without injecting any failure and hence there is no need for us to close the Tasks. Given that TestBroadcastAndReduce always runs until the Driver is completed, we always wait for all Tasks to finish. |
We should probably change that test to include deterministic, injected failures. |
@afchung The test send close event to every task. First task should close with CompletedTask, others, in most cases will stuck in data reading and eventually exception will be throw. With old behavior, it should end up to FailedTask. With new behavior, it will end up to FailedEvaluator, Now they all end up to CompletedTask, that is why I don't understand. Looks like they are all completed before receiving close event. I have changed it with REEF-1466, the behavior will be deterministic. |
Should we merge REEF-1466 first then? What do you suggest? I'm not familiar with the test so I will not be modifying it. |
Theoretically, we should merge REEF-1469 first, I will then need to update REEF-1466 to resolve the conflict. Then REEF-1447. But if you have any dependency on it, we might want to move quicker as you have limited days left and I will ensure this test work eventually. |
Yes, it will conflict with this PR. I'll merge if conflict occurs. Thanks! |
… not yet finished) This addressed the issue by * Failing the Evaluator on an `Exception` in `TaskCloseHandler`. * Adding a test to test for close event failure while Task is still running. Validating that the `TaskCloseHandler` `Exception` triggers a `FailedEvaluator`. * Modify `TaskRuntime` and related tests to verify that the `TaskStopHandler` is not called upon `Task` failure. JIRA: [REEF-1447](https://issues.apache.org/jira/browse/REEF-1447)
@jwang98052 @markusweimer I've resolved the conflict and rebased. Thanks! |
I will test and merge. |
… not yet finished)
This addressed the issue by
Exception
inTaskCloseHandler
.TaskCloseHandler
Exception
triggers aFailedEvaluator
.TaskRuntime
and related tests to verify that theTaskStopHandler
is not called uponTask
failure.JIRA:
REEF-1447