[FLINK-13452] Ensure to fail global when exception happens during reseting tasks of regions #9268

Myasuka · 2019-07-30T04:30:18Z

What is the purpose of the change

After FLINK-13060, we would run createResetAndRescheduleTasksCallback within another runnable resetAndRescheduleTasks. Unfortunately, any exception happened in createResetAndRescheduleTasksCallback would cause the thread terminated silently but record the exception in outcome of FutureTask. We should change the code back to previously that would failGlobal within the createResetAndRescheduleTasksCallback runnable.

Brief change log

59b1a6d5 :
- Let runnable createResetAndRescheduleTasksCallback fail global if come across any exception.
- Refine RegionFailoverITCase to mock the exception that checkpoint store would failed when recover from checkpoint for the 1st time.
b732229 : Refactor interface of RestartStrategy#restart and add UT to verify failGlobalIfErrorOnResetTasks
073d815
- Refactor the return value of RestartStrategy#restart to CompletableFuture
- Refatcor return valud of method ExecutionGraph#allVerticesInTerminalState to align with new RestartStrategy#restart
- Simplify AdaptedRestartPipelinedRegionStrategyNGFailoverTest with FailureMultiTimesRestartStrategy to throw exception directly
- Refactor RegionFailoverITCase with FailureMultiTimesRestartStrategy
- Add unit tests for FutureUtils#scheduleAsync

Verifying this change

This change added tests and can be verified as follows:

Refine RegionFailoverITCase to mock the exception that checkpoint store would failed when recover from checkpoint for the 1st time.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2019-07-30T04:32:23Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 3c6363e (Wed Aug 07 13:19:15 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-07-30T04:36:20Z

CI report:

59b1a6d : FAILURE Build
b732229 : FAILURE Build
073d815 : CANCELED Build
d7c1f12 : FAILURE Build
7afbb0b : FAILURE Build
8e791fd : FAILURE Build
704c4d8 : CANCELED Build
958e248 : SUCCESS Build
92cf36a : CANCELED Build
3c6363e : SUCCESS Build

Myasuka · 2019-07-30T06:49:19Z

CI failed due to FLINK-13488 and travis file cache not found.

Myasuka · 2019-07-30T10:32:42Z

This fix still has problem when failGlobal meet exception, close this PR first.

Myasuka · 2019-07-31T09:43:50Z

Reopen this wrt the comments from @zhuzhurk in FLINK-13452.

zhuzhurk

Thanks Yun for opening this PR. I have a few comments.

zhuzhurk · 2019-07-31T11:35:56Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

 			cancelTasks(verticesToRestart)
-				.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor())
-				.handle(failGlobalOnError()));
+				.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor()));


I think failure handling for the first part is still needed.

zhuzhurk · 2019-07-31T11:38:44Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

 				// re-schedule tasks
 				rescheduleTasks(unmodifiedVertices, globalModVersion);
 			} catch (GlobalModVersionMismatch e) {
 				throw new IllegalStateException(


All exceptions should be handled.

zhuzhurk · 2019-07-31T11:39:46Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

-		};
-	}
-
-	private BiFunction<Object, Throwable, Object> failGlobalOnError() {


This method is still needed for failure handling of the first part of failover process.

zhuzhurk · 2019-07-31T11:42:16Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

-			if (t != null) {
+			} catch (Throwable t) {
 				LOG.info("Unexpected error happens in region failover. Fail globally.", t);
 				failGlobal(t);


We can use FatalExitExceptionHandler to handle exceptions thrown from failGlobal directly.

tillrohrmann

Thanks for opening this PR @Myasuka. I think we cannot merge these changes because they change the handling of exceptionally completed cancel futures.

tillrohrmann · 2019-07-31T12:18:19Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

 		FutureUtils.assertNoException(
 			cancelTasks(verticesToRestart)
-				.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor())
-				.handle(failGlobalOnError()));


I think we should not change this logic since cancelTasks can return an exceptionally completed future which would now cause the process to terminate.

Agreed, I am refactoring this PR by changing the result of RestartStrategy#restart to a CompletableFuture<?>

tillrohrmann · 2019-07-31T12:20:14Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

-			if (t != null) {
+			} catch (Throwable t) {
 				LOG.info("Unexpected error happens in region failover. Fail globally.", t);
 				failGlobal(t);


I think it is better to let the exception handling happen where it has been. Usually it is a good idea to do the exception handling at the highest possible level.

GJL · 2019-08-01T12:53:45Z

...k-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/RestartStrategy.java

+	 * @return A {@link CompletableFuture} that will be completed when the restarting process is done.
 	 */
-	void restart(RestartCallback restarter, ScheduledExecutor executor);
+	CompletableFuture<?> restart(RestartCallback restarter, ScheduledExecutor executor);


I think that CompletableFuture<Void> would convey clearly that "nothing" is returned by the future.
Also, wildcards in return types should be avoided: https://stackoverflow.com/a/22815270

GJL · 2019-08-01T12:56:02Z

flink-runtime/src/main/java/org/apache/flink/runtime/concurrent/FutureUtils.java

+	 * @param <T> type of the result
+	 * @return Future which schedule the given operation with given delay.
+	 */
+	public static <T> CompletableFuture<T> scheduleAsync(


This method is not unit tested but I think it should be because it could be used outside of the restart strategies. See FutureUtilsTest

GJL · 2019-08-01T13:48:15Z

...apache/flink/runtime/executiongraph/AdaptedRestartPipelinedRegionStrategyNGFailoverTest.java

 			final JobGraph jobGraph,
-			final RestartStrategy restartStrategy) throws Exception {
+			final RestartStrategy restartStrategy,
+			final boolean failOnRecoverCheckpoint) throws Exception {


For unit testing the new behavior (i.e., failGlobal is invoked if an exception occurs while restarting), wouldn't it be enough to implement a RestartStrategy that returns an exceptionally completed future? I think it would be good enough because the exception doesn't have to be necessarily thrown while restoring checkpoints – in practice also an OOM can be thrown.

Agreed, combine testFailGlobalIfErrorOnResetTasks and previous testFailGlobalIfErrorOnRescheduleTasks into one unit test as testFailGlobalIfErrorOnRestartTasks with given RestartStrategy which would fail when restarting.

GJL · 2019-08-01T14:13:37Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/RegionFailoverITCase.java

 				unionListState = context.getOperatorStateStore().getUnionListState(unionStateDescriptor);
 				Set<Integer> actualIndices = StreamSupport.stream(unionListState.get().spliterator(), false).collect(Collectors.toSet());
-				Assert.assertTrue(CollectionUtils.isEqualCollection(EXPECTED_INDICES, actualIndices));
+				if (getRuntimeContext().getTaskName().contains(SINGLE_REGION_SOURCE_NAME)) {


Is this change really necessary to resolve FLINK-13452, or can it be in a separate ticket?

If we want to let this integration test to include global failover scenario, this is necessary.

GJL · 2019-08-01T14:22:31Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/RegionFailoverITCase.java

+		HighAvailabilityServicesUtilsTest.TestHAFactory.haServices = new FailHaServices(new TestingCheckpointRecoveryFactory(new FailRecoverCompletedCheckpointStore(1, 1), new StandaloneCheckpointIDCounter()), TestingUtils.defaultExecutor());
 		Configuration configuration = new Configuration();
 		configuration.setString(JobManagerOptions.EXECUTION_FAILOVER_STRATEGY, "region");
+		configuration.setString(HighAvailabilityOptions.HA_MODE, HighAvailabilityServicesUtilsTest.TestHAFactory.class.getName());


Interesting feature, didn't know about it before. However, I have two questions regarding this integration test:

As before, isn't it enough to use a RestartStrategy that returns an exceptionally completed future?

We already test that failGlobal is invoked from the unit tests. Does the integration test add coverage?

Previously, the RegionFailoverITCase did not catch bug of FLINK-13452 due to it did not involve global failover but only region failover. I prefer to add global failover in this integration test to verify the job could be restarted and result still correct.

GJL · 2019-08-01T14:25:48Z

Thank you for your contribution to Apache Flink @Myasuka. I think the PR is going in the right direction.

Myasuka · 2019-08-02T07:33:01Z

@GJL new commit content is updated in description.

UPDATED:
Confusion below have been fixed with new commit d7c1f12.

There is another question which might be out of the scope of this PR but confused me. If I set the failure times of FailureMultiTimesRestartStrategy as 2, which means the restartStrategy would failed twice when calling restart. During AdaptedRestartPipelinedRegionStrategyNG#restartTasks, job would first meet exception when restartStrategy.restart and then call failGlobalOnError. However, in the call stack of ExecutionGraph#failGlobal --> ExecutionGraph#allVerticesInTerminalState --> ExecutionGraph#tryRestartOrFail --> restartStrategy.restart, we would meet the 2nd exception and caught by FatalExitExceptionHandler, resulting in the process exited.
In other words, if we meet unchecked exception during ExecutionGraph#failGlobal, we would just let the whole process exit instead of trying to fail global again. Is this behavior expected?

GJL · 2019-08-05T10:12:55Z

Thanks for the update. I'll have a look.

GJL

All in all, the PR looks good. There are some minor things that require attention. If you want, I can apply the last changes myself. WDYT?

GJL · 2019-08-05T10:46:59Z

...time/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorFailureTest.java

 	}

-	private static final class FailingCompletedCheckpointStore implements CompletedCheckpointStore {
+	public static final class FailingCompletedCheckpointStore implements CompletedCheckpointStore {


Can be made private again.

GJL · 2019-08-05T10:50:07Z

...rg/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java

 	}

-	private Runnable resetAndRescheduleTasks(final long globalModVersion, final Set<ExecutionVertexVersion> vertexVersions) {
+	private CompletableFuture<?> resetAndRescheduleTasks(final long globalModVersion, final Set<ExecutionVertexVersion> vertexVersions) {


I would return Function<Object, CompletableFuture<Void>> here so that we can write:

FutureUtils.assertNoException( cancelTasks(verticesToRestart) .thenComposeAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor()) .handle(failGlobalOnError()));

This is consistent with the style before.

GJL · 2019-08-05T10:51:37Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/RegionFailoverITCase.java

 	private static class TestException extends IOException{
 		private static final long serialVersionUID = 1L;
 	}
+


I would revert this change (adding empty line).

GJL · 2019-08-05T10:52:36Z

...t/java/org/apache/flink/runtime/executiongraph/restart/FailureMultiTimesRestartStrategy.java

+	 *
+	 * @param configuration Configuration containing the parameter values for the restart strategy
+	 * @return Initialized instance of FixedDelayRestartStrategy
+	 * @throws Exception


GJL · 2019-08-05T12:20:39Z

flink-runtime/src/main/java/org/apache/flink/runtime/concurrent/FutureUtils.java

+	 * @param <T> type of the result
+	 * @return Future which schedule the given operation with given delay.
+	 */
+	public static <T> CompletableFuture<T> scheduleAsync(


I think we should add the following overload:

public static CompletableFuture<Void> scheduleWithDelay( final Runnable operation, final Time delay, final ScheduledExecutor scheduledExecutor) { Supplier<Void> operationSupplier = () -> { operation.run(); return null; }; return scheduleWithDelay(operationSupplier, delay, scheduledExecutor); }

This will reduce in the RestartStrategy implementations.

GJL · 2019-08-05T12:23:03Z

...t/java/org/apache/flink/runtime/executiongraph/restart/FailureMultiTimesRestartStrategy.java

+	@Override
+	public CompletableFuture<Void> restart(RestartCallback restarter, ScheduledExecutor executor) {
+		if (restartedTimes.incrementAndGet() <= failureMultiTimes) {
+			CompletableFuture<Void> exceptionFuture = new CompletableFuture<>();


You can use return FutureUtils.completedExceptionally(new FlinkRuntimeException(...))

GJL · 2019-08-05T13:35:49Z

flink-runtime/src/test/java/org/apache/flink/runtime/concurrent/FutureUtilsTest.java

+			TestingUtils.defaultScheduledExecutor());
+
+		int result = scheduleAsyncFuture.get();
+		long completionTime = System.currentTimeMillis() - start;


There some issues that I see here:

System.currentTimeMillis() is impacted by system clock changes – this may cuase test failures in conjunction with NTP. Therefore System.nanoTime() should be preferred.

Test will run at least 500ms which is on the slow side for unit tests. I think it's enough to use ManuallyTriggeredScheduledExecutor and see if the correct result is produced.

I don't think using AtomicInteger adds test coverage. Imo it's enough to assert on the value of result.

GJL · 2019-08-05T13:39:06Z

flink-runtime/src/test/java/org/apache/flink/runtime/concurrent/FutureUtilsTest.java

+
+		final ScheduledFuture<?> scheduledFuture = scheduledTasks.iterator().next();
+
+		assertFalse(scheduledFuture.isDone());


I don't think this assertion is needed because you already test if the future can be cancelled:

assertTrue(scheduledFuture.isCancelled());

This assertion will fail if the future is finished normally.

GJL · 2019-08-05T13:44:56Z

...src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java

+	public CompletableFuture<Void> restart(final RestartCallback restarter, ScheduledExecutor executor) {
 		currentRestartAttempt++;
-		executor.schedule(restarter::triggerFullRecovery, delayBetweenRestartAttempts, TimeUnit.MILLISECONDS);
+		Supplier<Void> restartSupplier = () -> {


This conversion to Supplier is duplicated multiple times. See my other comment in FutureUtils.

Myasuka · 2019-08-05T16:00:48Z

Sorry for late reply.
@GJL , No problem, please go ahead to take over this PR to finish the rest part.

GJL · 2019-08-05T20:10:12Z

Please review @tillrohrmann

tillrohrmann

Thanks for this fix @Myasuka and @GJL. LGTM. +1 for merging.

GJL · 2019-08-07T04:54:31Z

Merging as soon as build is green.

…uring reseting tasks of a region

GJL · 2019-08-07T13:18:11Z

Merged manually.

rmetzger added the review=description? label Jul 30, 2019

rmetzger added the component=Runtime/Coordination label Jul 30, 2019

Myasuka closed this Jul 30, 2019

Myasuka reopened this Jul 31, 2019

zhuzhurk reviewed Jul 31, 2019

View reviewed changes

tillrohrmann requested changes Jul 31, 2019

View reviewed changes

GJL self-assigned this Aug 1, 2019

GJL requested changes Aug 1, 2019

View reviewed changes

GJL requested changes Aug 5, 2019

View reviewed changes

GJL force-pushed the fix-schedule-NG branch from 7afbb0b to 8e791fd Compare August 5, 2019 19:37

tillrohrmann approved these changes Aug 6, 2019

View reviewed changes

GJL force-pushed the fix-schedule-NG branch from 958e248 to 92cf36a Compare August 7, 2019 04:50

GJL mentioned this pull request Aug 7, 2019

[1.9][FLINK-13452][runtime] Ensure to fail global when exception happens during reseting tasks of regions #9376

Merged

GJL approved these changes Aug 7, 2019

View reviewed changes

[FLINK-13452][runtime] Fail the job globally when exception happens d…

3c6363e

…uring reseting tasks of a region

GJL force-pushed the fix-schedule-NG branch from 92cf36a to 3c6363e Compare August 7, 2019 05:22

GJL closed this Aug 7, 2019


		final ScheduledFuture<?> scheduledFuture = scheduledTasks.iterator().next();

		assertFalse(scheduledFuture.isDone());

[FLINK-13452] Ensure to fail global when exception happens during reseting tasks of regions #9268

[FLINK-13452] Ensure to fail global when exception happens during reseting tasks of regions #9268

Uh oh!

Conversation

Myasuka commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Myasuka commented Jul 30, 2019

Uh oh!

Myasuka commented Jul 30, 2019

Uh oh!

Myasuka commented Jul 31, 2019

Uh oh!

zhuzhurk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GJL Aug 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GJL commented Aug 1, 2019

Uh oh!

Myasuka commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GJL commented Aug 5, 2019

Uh oh!

GJL left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Myasuka commented Jul 30, 2019 •

edited

Loading

flinkbot commented Jul 30, 2019 •

edited

Loading

flinkbot commented Jul 30, 2019 •

edited

Loading

zhuzhurk left a comment •

edited

Loading

GJL Aug 1, 2019 •

edited

Loading

Myasuka commented Aug 2, 2019 •

edited

Loading

GJL left a comment •

edited

Loading