Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

[REEF-1447] Validate Task close failure => FailedEvaluator Event (task has… #1042

Closed
wants to merge 1 commit into from

Conversation

afchung
Copy link
Contributor

@afchung afchung commented Jun 10, 2016

… not yet finished)

This addressed the issue by

  • Failing the Evaluator on an Exception in TaskCloseHandler.
  • Adding a test to test for close event failure while Task is still running. Validating that the TaskCloseHandler Exception triggers a FailedEvaluator.
  • Modify TaskRuntime and related tests to verify that the TaskStopHandler is not called upon Task failure.

JIRA:
REEF-1447

@@ -110,9 +110,9 @@ public Thread StartTaskOnNewThread()
"Task running result:\r\n" + System.Text.Encoding.Default.GetString(result));
}
}
catch (TaskStopHandlerException e)
catch (TaskStopHandlerException stopException)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clarification. But this is a pure syntactic change, maybe leave it out of this PR?

@markusweimer
Copy link
Contributor

I've done a pass.

@afchung
Copy link
Contributor Author

afchung commented Jun 10, 2016

@markusweimer replied to your comments. The current design only reports the earliest Exception to the user. In terms of when we encounter multiple Exception s, which idea seems better? Would like your input before implementing the next iteration. Thanks!

@afchung
Copy link
Contributor Author

afchung commented Jun 14, 2016

@markusweimer what are your thoughts on how to handle multiple Exceptions. Should we just log and ignore? Or should we create AggregateExceptions?

@@ -98,28 +98,48 @@ public void SetException(Exception e)
{
lock (_heartBeatManager)
{
try
if (_lastException.IsPresent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to ensure failed task is only send once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jwang98052
Copy link
Contributor

I have done a pass.

@afchung
Copy link
Contributor Author

afchung commented Jun 17, 2016

@jwang98052 @markusweimer I'll resolve the conflict once we reach a consensus on how to handle multiple Exceptions on a failed Task. Thanks!

{
if (!_lastException.IsPresent())
{
throw new ApplicationException("Expected failed Task to already have set an Exception.");
Copy link
Contributor

@andrey-me andrey-me Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is optional in SetException...do we really want to generate new exception here?
To clarify, it seem to check for integrity of the Failed state in wrong place: If we want to fast fail we should fail in SetException().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only reach _state = Failed in SetException, and SetException has to set _lastException to a "present" Optional. Everything is also done with a lock held on _heartbeatManager. Thus, if we ever reach a state where _state == Failed but _lastException is not "present", we must have had some sort of application logic and we should not call SetException again. Instead, we should crash the Evaluator with an ApplicationException that is not caught in the upper layers.

Copy link
Contributor

@andrey-me andrey-me Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can we crash evaluator in the SetException right away?
Basically if we want to guarantee state integrity, it is easier to do during state change, otherwise we may need to do these checks in multiply places afterwards.


In reply to: 67907809 [](ancestors = 67907809)

@afchung afchung changed the title [REEF-1447] Validate Task close failure => FailedTask Event (task has… [REEF-1447] Validate Task close failure => FailedEvaluator Event (task has… Jun 21, 2016
@afchung
Copy link
Contributor Author

afchung commented Jun 21, 2016

@jwang98052 @markusweimer @andrey-me This is ready for another round. Thanks!

@jwang98052 note that we have changed the behavior of an Exception occurring in TaskCloseHandler. This may change the behavior in fault tolerance, when you are forcing Tasks to close. I've modified the tests to fit the behavior, but there might be things I'm missing. Please take a closer look. Thanks!

@afchung
Copy link
Contributor Author

afchung commented Jun 21, 2016

@jwang98052 for the motivation of the change, please read the JIRA. Thanks!

@jwang98052
Copy link
Contributor

It does break the current implementation of IMRU fault tolerant. I have opened REEF-1466 for the solution/discussion.

@markusweimer
Copy link
Contributor

It does break the current implementation of IMRU fault tolerant. I have opened REEF-1466 for the solution/discussion.

@jwang98052 As discussed over on JIRA, the reliance on exceptions thrown by a StopHandler to stop a Task does not work, regardless of whether or not this gets merged. Hence, shall we go ahead with this change?

@@ -41,6 +41,7 @@ internal sealed class TaskRuntime : IObserver<ICloseEvent>, IObserver<ISuspendEv
private readonly IInjectionFuture<IObserver<ISuspendEvent>> _suspendHandlerFuture;
private readonly IInjectionFuture<IObserver<ICloseEvent>> _closeHandlerFuture;
private int _taskRan = 0;
private int _taskClosed = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool does not work with Interlocked.

@markusweimer
Copy link
Contributor

I've done another pass.

}

State = TaskState.Failed;
_taskLifeCycle.Stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if task exception happens, don't call TaskStopHandler, right? In another word, TaskStopHandler is only for happen scenario, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jwang98052
Copy link
Contributor

@markusweimer After REEF-1466 is resolved, we should be fine with this change. In REEF-1466, we will use cancellation token to let the task return properly from close handler instead of throw exception.
@afchung with this change, I would expect IMRUCloseTaskTest fail as it throws exception in task close handler and expecting FailedTask if the task cannot return properly. I don't know how it passed test somehow. I have updated this test in REEF-1466.

@afchung
Copy link
Contributor Author

afchung commented Jun 24, 2016

@markusweimer @jwang98052 I've addressed your comments, please have another look. Thanks!

@afchung
Copy link
Contributor Author

afchung commented Jun 27, 2016

@jwang98052 is this ready to be merged? I've updated IMRUCloseTaskTest in this PR to handle the failed Evaluator scenario, that is why it is passing.

@jwang98052
Copy link
Contributor

@afchung I see you modified TaskCloseTest, but didn't find the change for IMRUCloseTaskTest. Did I miss it?

@afchung
Copy link
Contributor Author

afchung commented Jun 27, 2016

@jwang98052 it seems like I didn't need to change it since the Tasks always complete. My understanding taking a glance as the code is that what the test does is it just runs broadcast and reduce without injecting any failure and hence there is no need for us to close the Tasks. Given that TestBroadcastAndReduce always runs until the Driver is completed, we always wait for all Tasks to finish.

@afchung
Copy link
Contributor Author

afchung commented Jun 27, 2016

We should probably change that test to include deterministic, injected failures.

@jwang98052
Copy link
Contributor

@afchung The test send close event to every task. First task should close with CompletedTask, others, in most cases will stuck in data reading and eventually exception will be throw. With old behavior, it should end up to FailedTask. With new behavior, it will end up to FailedEvaluator, Now they all end up to CompletedTask, that is why I don't understand. Looks like they are all completed before receiving close event.

I have changed it with REEF-1466, the behavior will be deterministic.

@afchung
Copy link
Contributor Author

afchung commented Jun 27, 2016

Should we merge REEF-1466 first then? What do you suggest? I'm not familiar with the test so I will not be modifying it.

@jwang98052
Copy link
Contributor

Theoretically, we should merge REEF-1469 first, I will then need to update REEF-1466 to resolve the conflict. Then REEF-1447. But if you have any dependency on it, we might want to move quicker as you have limited days left and I will ensure this test work eventually.

@afchung
Copy link
Contributor Author

afchung commented Jun 27, 2016

I have a few more tests that depend on classes in this PR (see #1057). Alternatively we can try to merge #1057 first. Either way I would like to get WaitingTask and EventHandle into the main codebase first.

@jwang98052
Copy link
Contributor

I will look #1057 as early as I can. Would it conflict with this one #1042?

@afchung
Copy link
Contributor Author

afchung commented Jun 28, 2016

Yes, it will conflict with this PR. I'll merge if conflict occurs. Thanks!

… not yet finished)

This addressed the issue by
  * Failing the Evaluator on an `Exception` in `TaskCloseHandler`.
  * Adding a test to test for close event failure while Task is still running. Validating that the `TaskCloseHandler` `Exception` triggers a `FailedEvaluator`.
  * Modify `TaskRuntime` and related tests to verify that the `TaskStopHandler` is not called upon `Task` failure.

JIRA:
  [REEF-1447](https://issues.apache.org/jira/browse/REEF-1447)
@afchung
Copy link
Contributor Author

afchung commented Jul 1, 2016

@jwang98052 @markusweimer I've resolved the conflict and rebased. Thanks!

@jwang98052
Copy link
Contributor

I will test and merge.

@asfgit asfgit closed this in 4ab6a8d Jul 1, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants