Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport] [FLINK-4932] [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible #2711

Conversation

tillrohrmann
Copy link
Contributor

This PR is a backport of #2710 for the release-1.1 branch.

If in state RESTARTING a failure occurs (ExecutionGraph.fail is called), then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type SuppressRestartsException a job can go from RESTARTING into FAILED.

@uce
Copy link
Contributor

uce commented Oct 28, 2016

Code changes and updated figure look good to me. +1 to merge after Travis passes. It makes sense that for instance a job with infinite restarts will only be failed if the job is explicitly suspended or forced to suppress restarts. I was curious how you noticed this issue? Did a user run into this?

@tillrohrmann
Copy link
Contributor Author

Thanks for the review @uce. @StephanEwen actually discovered this problem. The problem was related to #2700. When the scheduleOrUpdateConsumers call was triggered when in state restarting due to a late rpc, it would fail the complete ExecutionGraph, because it could not find the respective Execution.

Will rebase the PR, because release-1.1 contained a problem which prevented it from building.

…EG if no more restarts are possible

If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the
restart strategy no longer allows further restarts or if the thrown exception is of type
SuppressRestartsException a job can go from RESTARTING into FAILED.

Fix failing test cases: ExecutionGraphMetricsTest and ExecutionGraphRestartTest
@StephanEwen
Copy link
Contributor

Thanks for the fix!
Looks good, merging this...

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Oct 31, 2016
…nly fails terminally if no more restarts are possible

If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the
restart strategy no longer allows further restarts or if the thrown exception is of type
SuppressRestartsException a job can go from RESTARTING into FAILED.

Fix failing test cases: ExecutionGraphMetricsTest and ExecutionGraphRestartTest

This closes apache#2711
@tillrohrmann
Copy link
Contributor Author

Closing this because it has been merged into the release 1.1 branch.

@tillrohrmann tillrohrmann deleted the backportFixRestartingState branch November 1, 2016 14:02
skidder pushed a commit to muxinc/flink that referenced this pull request Dec 27, 2016
…nly fails terminally if no more restarts are possible

If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the
restart strategy no longer allows further restarts or if the thrown exception is of type
SuppressRestartsException a job can go from RESTARTING into FAILED.

Fix failing test cases: ExecutionGraphMetricsTest and ExecutionGraphRestartTest

This closes apache#2711
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants