Skip to content

[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment#24680

Merged
zentol merged 2 commits intoapache:masterfrom
zentol:35159
Apr 19, 2024
Merged

[FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment#24680
zentol merged 2 commits intoapache:masterfrom
zentol:35159

Conversation

@zentol
Copy link
Contributor

@zentol zentol commented Apr 18, 2024

Transitioning the ExecutionGraph to running implicitly starts periodic checkpoint triggering. This should only occur after the all initialization steps have occurred.
The existing code could leak the CheckpointCoordinator if '#handleExecutionGraphCreation' fails to 'tryToAssignSlots'; this eventually leads to a checkpoint being triggered for old ExecutionGraph for which the operator holders were not initialized yet, which causes an exception that crashes the JM.

We now force this to happen in the Executing constructor, at which point all the initialization steps must already be complete, and we can be sure that the EG gets transitioned into a terminal state when an error occurs (implicitly disabling checkpointing again).

We now do this only after slots have been assigned, right before the transition into Executing. At that point we either transition into executing (where we can be sure we'll transition into a terminal state eventually) or fail hard if an error occurs.

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 18, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@zentol
Copy link
Contributor Author

zentol commented Apr 18, 2024

meh, stop-with-savepoint failures need to be able to transition back into executing without recreating the EG :/

Blocking edges aren't supported by the AdaptiveScheduler in the first place, so there's no point in testing what happens when a savepoint is triggered for a job with blocking edges.
This wasn't caught earlier because the test wasn't very good to start with.
@zentol zentol changed the title [FLINK-35159] Transition ExecutionGraph to RUNNING in Executing state [FLINK-35159] Transition ExecutionGraph to RUNNING after slot assignment Apr 18, 2024
@dmvk dmvk self-requested a review April 18, 2024 18:09
Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for digging up the root cause!

I was initially bit worried about the testing part, but it feels that hardening testNotPossibleSlotAssignmentTransitionsToWaitingForResources should be enough, because smoke tests should cover that this didn't break anything.

}

@Test
void testSavepointFailsWhenBlockingEdgeExists() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why / when this was introduced? it indeed doesn't seem to be related AS; is this tested somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added in FLINK-34371 and is covered by tests in the DefaultScheduler:

d4e0084#diff-b4bc1cd606feb86850a18371520b5dd63d02090b8567fdf80730d1d7dd6e693d

getGraph(new StateTrackingMockExecutionGraph()));
getGraph(executionGraph));

assertThat(executionGraph.getState()).isEqualTo(JobStatus.INITIALIZING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, the graph would be running before the change; makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the graph would be running before the change

yes

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 👍

@zentol zentol merged commit 131358b into apache:master Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants