[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment by zentol · Pull Request #24680 · apache/flink

zentol · 2024-04-18T13:14:58Z

Transitioning the ExecutionGraph to running implicitly starts periodic checkpoint triggering. This should only occur after the all initialization steps have occurred.
The existing code could leak the CheckpointCoordinator if '#handleExecutionGraphCreation' fails to 'tryToAssignSlots'; this eventually leads to a checkpoint being triggered for old ExecutionGraph for which the operator holders were not initialized yet, which causes an exception that crashes the JM.

We now force this to happen in the Executing constructor, at which point all the initialization steps must already be complete, and we can be sure that the EG gets transitioned into a terminal state when an error occurs (implicitly disabling checkpointing again).

We now do this only after slots have been assigned, right before the transition into Executing. At that point we either transition into executing (where we can be sure we'll transition into a terminal state eventually) or fail hard if an error occurs.

flinkbot · 2024-04-18T13:23:15Z

CI report:

368ba31 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

zentol · 2024-04-18T17:01:26Z

meh, stop-with-savepoint failures need to be able to transition back into executing without recreating the EG :/

Blocking edges aren't supported by the AdaptiveScheduler in the first place, so there's no point in testing what happens when a savepoint is triggered for a job with blocking edges. This wasn't caught earlier because the test wasn't very good to start with.

dmvk

LGTM 👍 Thanks for digging up the root cause!

I was initially bit worried about the testing part, but it feels that hardening testNotPossibleSlotAssignmentTransitionsToWaitingForResources should be enough, because smoke tests should cover that this didn't break anything.

...me/src/test/java/org/apache/flink/runtime/scheduler/adaptive/CreatingExecutionGraphTest.java

dmvk · 2024-04-18T18:14:55Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

    }

-    @Test
-    void testSavepointFailsWhenBlockingEdgeExists() throws Exception {


do you know why / when this was introduced? it indeed doesn't seem to be related AS; is this tested somewhere else?

This was added in FLINK-34371 and is covered by tests in the DefaultScheduler:

d4e0084#diff-b4bc1cd606feb86850a18371520b5dd63d02090b8567fdf80730d1d7dd6e693d

dmvk · 2024-04-18T18:16:23Z

...me/src/test/java/org/apache/flink/runtime/scheduler/adaptive/CreatingExecutionGraphTest.java

-                getGraph(new StateTrackingMockExecutionGraph()));
+                getGraph(executionGraph));
+
+        assertThat(executionGraph.getState()).isEqualTo(JobStatus.INITIALIZING);


iiuc, the graph would be running before the change; makes sense

the graph would be running before the change

yes

reswqa

Good catch! 👍

zentol force-pushed the 35159 branch from bac810f to 95ebfc1 Compare April 18, 2024 13:28

flinkbot added the component=Runtime/Coordination label Apr 18, 2024

zentol force-pushed the 35159 branch from 95ebfc1 to a80dddc Compare April 18, 2024 17:11

zentol changed the title ~~[FLINK-35159] Transition ExecutionGraph to RUNNING in Executing state~~ [FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment Apr 18, 2024

dmvk self-requested a review April 18, 2024 18:09

dmvk approved these changes Apr 18, 2024

View reviewed changes

[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment

368ba31

zentol force-pushed the 35159 branch from a80dddc to 368ba31 Compare April 18, 2024 18:50

reswqa approved these changes Apr 19, 2024

View reviewed changes

zentol merged commit 131358b into apache:master Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment#24680

[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment#24680
zentol merged 2 commits intoapache:masterfrom
zentol:35159

zentol commented Apr 18, 2024 •

edited

Loading

Uh oh!

flinkbot commented Apr 18, 2024 •

edited

Loading

Uh oh!

zentol commented Apr 18, 2024

Uh oh!

dmvk left a comment

Uh oh!

Uh oh!

dmvk Apr 18, 2024

Uh oh!

zentol Apr 18, 2024

Uh oh!

dmvk Apr 18, 2024

Uh oh!

zentol Apr 18, 2024

Uh oh!

reswqa left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zentol commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flinkbot commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

zentol commented Apr 18, 2024

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmvk Apr 18, 2024

Choose a reason for hiding this comment

Uh oh!

zentol Apr 18, 2024

Choose a reason for hiding this comment

Uh oh!

dmvk Apr 18, 2024

Choose a reason for hiding this comment

Uh oh!

zentol Apr 18, 2024

Choose a reason for hiding this comment

Uh oh!

reswqa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zentol commented Apr 18, 2024 •

edited

Loading

flinkbot commented Apr 18, 2024 •

edited

Loading