-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-15918][FLINK-15917] Uptime Metric not reset on Job Restart / Root Exception not shown in Web UI #11032
Conversation
354d125
to
a1422e1
Compare
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 354d125 (Thu Feb 06 14:19:08 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
CI report:
Bot commandsThe @flinkbot bot supports the following commands:
|
@@ -210,6 +216,9 @@ private void restartTasksWithDelay(final FailureHandlingResult failureHandlingRe | |||
final Set<ExecutionVertexVersion> executionVertexVersions = | |||
new HashSet<>(executionVertexVersioner.recordVertexModifications(verticesToRestart).values()); | |||
|
|||
transitionExecutionGraphState(JobStatus.RUNNING, JobStatus.RESTARTING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract method
@@ -223,6 +232,11 @@ private Runnable restartTasks(final Set<ExecutionVertexVersion> executionVertexV | |||
return () -> { | |||
final Set<ExecutionVertexID> verticesToRestart = executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions); | |||
|
|||
verticesWaitingForRestart.removeAll(verticesToRestart); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract method
I've started a separate build on travis-ci.com. You can track the progress here: https://travis-ci.com/flink-ci/flink/builds/147751241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @GJL. I think at the moment we cannot directly merge the PR. The problem is that ExecutionGraph.cancel
will transition the ExecutionGraph
directly into a globally terminal state if it is in state RESTARTING
. Maybe we could remove this short cut and handle the RESTARTING
case as we handle RUNNING
and CREATED
.
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/DefaultSchedulerTest.java
Show resolved
Hide resolved
https://api.travis-ci.org/v3/job/646941244/log.txt
Shouldn't be related? |
Looks this is not an issue. Just that the upstream task was canceled earlier and triggered a CancelTaskException of the downstream task which is still processing data on the same TM. |
I was just about to post that this is not a problem. It has been fixed on master but not on release-1.10. See https://issues.apache.org/jira/browse/FLINK-15751?focusedCommentId=17024366&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17024366 |
https://travis-ci.com/flink-ci/flink/builds/147754663 seems to pass as well. Could we check whether we have a test ensuring that when we are in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @GJL . The change looks good to me.
+1 to merge it.
This fix ensures the backward compatibility of several metrics. However, I think some of the metrics are not very accurate now since region failover and batch scheduling are used by more and more users. I think in 1.11 we should revisit these metrics to see how to fix them for all these new scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @GJL. LGTM. I had a single comment concerning the newly added test. Once this comment is resolved +1 for merging.
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/DefaultSchedulerTest.java
Show resolved
Hide resolved
…start The 'uptime' metric is the time difference between 'now' and the timestamp when the job transitioned to state RUNNING. The new scheduler until now never transitioned out of status RUNNING when restarting tasks which had the effect that the 'uptime' metric was not reset after a restart. This introduces new state transitions to the job. We transition the job status to RESTARTING if at least one ExecutionVertex is waiting to be restarted, and we transition from RESTARTING immediately to RUNNING again after the restart. This closes apache#11032.
…start The 'uptime' metric is the time difference between 'now' and the timestamp when the job transitioned to state RUNNING. The new scheduler until now never transitioned out of status RUNNING when restarting tasks which had the effect that the 'uptime' metric was not reset after a restart. This introduces new state transitions to the job. We transition the job status to RESTARTING if at least one ExecutionVertex is waiting to be restarted, and we transition from RESTARTING immediately to RUNNING again after the restart. This closes #11032.
…start The 'uptime' metric is the time difference between 'now' and the timestamp when the job transitioned to state RUNNING. The new scheduler until now never transitioned out of status RUNNING when restarting tasks which had the effect that the 'uptime' metric was not reset after a restart. This introduces new state transitions to the job. We transition the job status to RESTARTING if at least one ExecutionVertex is waiting to be restarted, and we transition from RESTARTING immediately to RUNNING again after the restart. This closes apache#11032.
…start The 'uptime' metric is the time difference between 'now' and the timestamp when the job transitioned to state RUNNING. The new scheduler until now never transitioned out of status RUNNING when restarting tasks which had the effect that the 'uptime' metric was not reset after a restart. This introduces new state transitions to the job. We transition the job status to RESTARTING if at least one ExecutionVertex is waiting to be restarted, and we transition from RESTARTING immediately to RUNNING again after the restart. This closes apache#11032.
What is the purpose of the change
(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)
Brief change log
(for example:)
Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation