Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED #2095

Closed

Conversation

tillrohrmann
Copy link
Contributor

A job can get stuck in FAILING if fail is called on a restarting job which has
not yet reset its ExecutionJobVertices, because these vertices would not call
jobVertexInFinalState. This method, however, must be called in order to transition
from FAILING to FAILED. In order to solve the problem, this PR introduces a direct
state transition from RESTARTING to FAILED, if fail is called when being in state
RESTARTING.

… FAILED

A job can get stuck in FAILING if fail is called on a restarting job which has
not yet reset its ExecutionJobVertices, because these vertices would not call
jobVertexInFinalState. This method, however, must be called in order to transition
from FAILING to FAILED.
@uce
Copy link
Contributor

uce commented Jun 22, 2016

+1 to merge. Both the direct state transition and test are good.

@uce
Copy link
Contributor

uce commented Jun 22, 2016

PS: Should the early abort check for the CANCELED state in ExecutionGraph.restart() be extended to any terminal state? The restart method assumes that it's the only one transitioning away from RESTARTING, which is not the case any more. I think it's not needed, but might lead to confusing log messages when fail and restart are called concurrently.

@tillrohrmann
Copy link
Contributor Author

Thanks for the review @uce. I think you're right and we should also accept the FAILED state when calling restart and simply print a log message instead of throwing the IllegalStateException. But I'm not so sure for any terminal state because we should not have reached the state FINISHED. This usually indicates a failure and should not go unnoticed.

I will add the FAILED check and then merge the PR.

@asfgit asfgit closed this in cfe6293 Jun 23, 2016
@uce
Copy link
Contributor

uce commented Jun 23, 2016

Agreed. I had the same thought regarding checking for any final state right after posting this. ;) +1 to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants