Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

[REEF-1343] Fix events received in case of evaluator failure #961

Closed
wants to merge 2 commits into from

Conversation

afchung
Copy link
Contributor

@afchung afchung commented Apr 20, 2016

This addressed the issue by

  • Invoke RuntimeStop on Exception in RuntimeClock.
  • Fix handling of Exceptions in EvaluatorRuntime.

JIRA:
REEF-1343

This addressed the issue by
  * Invoke RuntimeStop on Exception in RuntimeClock.
  * Fix handling of Exceptions in EvaluatorRuntime.

JIRA:
  [REEF-1343](https://issues.apache.org/jira/browse/REEF-1343)
{
lock (_heartBeatManager)
{
Logger.Log(Level.Error, string.Format(CultureInfo.InvariantCulture, "evaluator {0} failed with exception", _evaluatorId), e);
Logger.Log(Level.Error, "evaluator {0} failed with exception {1}.", _evaluatorId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluator -> Evaluator

@markusweimer
Copy link
Contributor

Done with a first pass, mostly questions :)

@afchung
Copy link
Contributor Author

afchung commented Apr 21, 2016

@tcNickolas would be great if you can have a pass as well.


if (_state == State.RUNNING)
{
_state = State.DONE;
_heartBeatManager.OnNext();
OnException(runtimeStop.Exception);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: check existence of Exception.

@tcNickolas
Copy link
Member

Could you please describe the root cause of the issue in JIRA? Currently JIRA has only a description of symptoms, not the cause.

@afchung
Copy link
Contributor Author

afchung commented Apr 21, 2016

@markusweimer Addressed your comments.

@afchung
Copy link
Contributor Author

afchung commented Apr 21, 2016

@tcNickolas sure.

@afchung
Copy link
Contributor Author

afchung commented Apr 21, 2016

@tcNickolas updated the JIRA description with the cause. Thanks!

@tcNickolas
Copy link
Member

There are several exceptions logged which I'm a bit concerned about.

  1. In evaluator log,
    Org.Apache.REEF.Common.Tasks.Exceptions.TaskCloseHandlerNotBoundException: No EventHandler<CloseEvent> registered.
    Is it a problem if task doesn't have CloseEvent handler?
  2. In driver log,
    org.apache.reef.exception.EvaluatorException: java.lang.Exception: Exception sent, but can't be deserialized
    Does it mean we need to define PoisonException in some special way for it to be serializable?

@afchung
Copy link
Contributor Author

afchung commented Apr 21, 2016

@tcNickolas both Exceptions are out of the scope of this PR, but answering your questions

  1. Yes. This is by design, same goes in the Java side. This is because if an Task receives a CloseEvent when it shouldn't, it should not be ignored. In this case, when the Evaluator is force-shut-down, a CloseEvent is sent to the Task to try to warn it such that a user may have a chance to perform some sort of state-preserving operation. If the user does not choose to implement the Handler for CloseEvent, the Task will be force-shut-down anyhow.
  2. This is relevant to the work to be done in REEF-1286. If the Evaluator fails on a .NET failure, Java cannot deserialize it.

@tcNickolas
Copy link
Member

I see. We probably need to add a CloseEvent handler to tasks which we know are going to get shut down. But I can do this in REEF-1304

From test behavior point of view, this change looks good to me.

}
catch (Exception e)
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if the sate is already DONE or any other state that might reach here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other State other than INIT and RUNNING are essentially DONE.

@afchung
Copy link
Contributor Author

afchung commented Apr 22, 2016

@jwang98052 @markusweimer any other comments?

@markusweimer
Copy link
Contributor

LGTM, will test and merge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants