Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-21401] Consolidate JobGraph generation to properly instantiate streaming and batch JobGraphs #15093

Merged
merged 25 commits into from Mar 6, 2021

Conversation

tillrohrmann
Copy link
Contributor

This PR consolidates the JobGraph generation into as few places as possible. Moreover it introduces explicit factory methods for streaming and batch JobGraphs (JobType.STREAMING and JobType.BATCH) in order to better classify which tests can be run with the AdaptiveScheduler and which strictly require the DefaultScheduler.

This PR also introduces a JobGraphBuilder which can be used to make the JobGraph an immutable data structure. This builder is then used for the more complex JobGraph creations for tests.

Last but not least, this PR adapts some failing tests (mainly DispatcherTest) to also work with the AdaptiveScheduler.

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 5, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ff8af80 (Fri Mar 05 09:05:42 UTC 2021)

Warnings:

  • 1 pom.xml files were touched: Check for build and licensing issues.
  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 5, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@rmetzger rmetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the cleanups! Except for one minor oversight, I couldn't find any major issues in the change.

+1 to merge once CI passed.

JobGraphBuilder.newStreamingJobGraphBuilder()
.addJobVertices(Arrays.asList(source, target))
.addClasspaths(Collections.singletonList(jarFileInJobGraph.toUri().toURL()))
.build();

jobGraph.setClasspaths(Collections.singletonList(jarFileInJobGraph.toUri().toURL()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jobGraph.setClasspaths(Collections.singletonList(jarFileInJobGraph.toUri().toURL()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will remove it.

@@ -260,100 +245,6 @@ public static void teardownClass() {
}
}

@Test
public void testDeclineCheckpointInvocationWithUserException() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have added a reference to this discussion. Thanks for digging it up.

fail("Previous statement should have failed");
} catch (ExecutionException t) {
assertTrue(t.getCause() instanceof UnavailableDispatcherOperationException);
}

// submission has succeeded, let the initialization finish.
blockingJobGraph.f1.unblock();
latch.trigger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer than the Tuple2 with the vertex! Thanks for the cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Tuple2 is actually quite ugly. We should try to never use it if possible.

.setJobId(jobId)
.addJobVertex(blockingJobVertex)
.build(),
blockingJobVertex);
}

private static class FailingJobVertex extends JobVertex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the various cleanups! I took some notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a couple of tests which relied on scheduler details. I think before we did not have the necessity to separate tests thoroughly.

tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Mar 5, 2021
The AdaptiveScheduler needs to make the JobVertices serializable and let them use
static fields to communicate with the test because the JobGraph is copied.

This closes apache#15093.
@tillrohrmann
Copy link
Contributor Author

Thanks for the review @rmetzger. I've addressed your comments. I will merge this PR once AZP gives green light.

tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Mar 5, 2021
The AdaptiveScheduler needs to make the JobVertices serializable and let them use
static fields to communicate with the test because the JobGraph is copied.

This closes apache#15093.
…hUserException to not need JobMaster instance
…endent of underlying JobMaster implementation

This commit makes the DispatcherTest.testErrorDuringInitialization independent of the underlying
JobMaster implementation by splitting it into DispatcherTest.testJobManagerRunnerInitializationFailureFailsJob
and JobManagerRunnerImplTest.testJobMasterCreationFailureCompletesJobManagerRunnerWithInitializationError which
tests in two steps what DispatcherTest.testErrorDuringInitialization tested by relying on the DefaultScheduler
to eagerly create an ExecutionGraph and to fail.
… not depend on underlying Scheduler implementation

By using the BlockingJobManagerRunnerFactory it is possible to decouple the DispatcherTest.testInvalidCallDuringInitialization
from the underlying Scheduler implementation which was used for blocking the creation of the JobManagerRunner before.
…ndependent of Scheduler implementation

By letting the DispatcherTest.testWaitingForJobMasterLeadership use the TestingJobManagerRunnerFactory we can abstract this
test from the implementation details of the Scheduler and when it shows which JobStatus.
…dToExecutionGraph independent of Scheduler implementation

Change Dispatcher.testInitializationTimestampForwardedToExecutionGraph to testInitializationTimestampForwardedToJobManagerRunner
which only tests that the initialization timestamp is forwarded to the JobManagerRunner. Additionally, this commit adds
AdaptiveSchedulerTest.testExecutionGraphGenerationSetsInitializationTimestamp, AdaptiveSchedulerTest.testInitializationTimestampForwarding
and DefaultSchedulerTest.testCorrectSettingOfInitializationTimestamp to test the timestamp forwarding for the different Scheduler
implementations.
…BeStarted independent of Scheduler implementation

This commit makes the DispatcherTest.testFatalErrorIfRecoveredJobsCannotBeStarted independent of the scheduler
implementation by using the TestingJobManagerRunnerFactory and letting the TestingJobManagerRunner complete
with an initialization error.
…ndent of underlying scheduler

Make DispatcherTest.testNonBlockingJobSubmission independent of underlying scheduler by using the
BlockingJobManagerRunnerFactory and blocking the creation of the JobManagerRunner.
…ot needed

The leader elections are only needed where we instantiate a proper JobManagerRunnerImpl which
needs leader election.
The AdaptiveScheduler needs to make the JobVertices serializable and let them use
static fields to communicate with the test because the JobGraph is copied.

This closes apache#15093.
…or the adaptive scheduler

The adaptive scheduler does not support stop with savepoint yet. Therefore, this test cannot work.
…otCreation and .testStopWithSavepointFailingAfterSnapshotCreation for the AdaptiveScheduler

The AdaptiveScheduler does not support stopping jobs with savepoint yet.
…ncreasing resource timeout to 30s for the AdaptiveScheduler

The test seems to take some time to start the required TaskManagers. This can sometimes lead to resource timeouts in the
WaitingForResources state if it takes too long. This commit fixes this problem by increasing the resource timeout to 30s.
@tillrohrmann tillrohrmann merged commit 364a8ef into apache:master Mar 6, 2021
@tillrohrmann tillrohrmann deleted the FLINK-21401 branch March 6, 2021 12:25
autophagy pushed a commit to autophagy/flink that referenced this pull request Mar 16, 2021
The AdaptiveScheduler needs to make the JobVertices serializable and let them use
static fields to communicate with the test because the JobGraph is copied.

This closes apache#15093.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants