Skip to content

Conversation

@Myasuka
Copy link
Member

@Myasuka Myasuka commented Jul 30, 2019

What is the purpose of the change

After FLINK-13060, we would run createResetAndRescheduleTasksCallback within another runnable resetAndRescheduleTasks. Unfortunately, any exception happened in createResetAndRescheduleTasksCallback would cause the thread terminated silently but record the exception in outcome of FutureTask. We should change the code back to previously that would failGlobal within the createResetAndRescheduleTasksCallback runnable.

Brief change log

  • 59b1a6d5 :

    • Let runnable createResetAndRescheduleTasksCallback fail global if come across any exception.
    • Refine RegionFailoverITCase to mock the exception that checkpoint store would failed when recover from checkpoint for the 1st time.
  • b732229 : Refactor interface of RestartStrategy#restart and add UT to verify failGlobalIfErrorOnResetTasks

  • 073d815

    • Refactor the return value of RestartStrategy#restart to CompletableFuture
    • Refatcor return valud of method ExecutionGraph#allVerticesInTerminalState to align with new RestartStrategy#restart
    • Simplify AdaptedRestartPipelinedRegionStrategyNGFailoverTest with FailureMultiTimesRestartStrategy to throw exception directly
    • Refactor RegionFailoverITCase with FailureMultiTimesRestartStrategy
    • Add unit tests for FutureUtils#scheduleAsync

Verifying this change

This change added tests and can be verified as follows:

  • Refine RegionFailoverITCase to mock the exception that checkpoint store would failed when recover from checkpoint for the 1st time.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 30, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 3c6363e (Wed Aug 07 13:19:15 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 30, 2019

CI report:

@Myasuka
Copy link
Member Author

Myasuka commented Jul 30, 2019

CI failed due to FLINK-13488 and travis file cache not found.

@Myasuka
Copy link
Member Author

Myasuka commented Jul 30, 2019

This fix still has problem when failGlobal meet exception, close this PR first.

@Myasuka Myasuka closed this Jul 30, 2019
@Myasuka Myasuka reopened this Jul 31, 2019
@Myasuka
Copy link
Member Author

Myasuka commented Jul 31, 2019

Reopen this wrt the comments from @zhuzhurk in FLINK-13452.

Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yun for opening this PR. I have a few comments.

cancelTasks(verticesToRestart)
.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor())
.handle(failGlobalOnError()));
.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think failure handling for the first part is still needed.

// re-schedule tasks
rescheduleTasks(unmodifiedVertices, globalModVersion);
} catch (GlobalModVersionMismatch e) {
throw new IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All exceptions should be handled.

};
}

private BiFunction<Object, Throwable, Object> failGlobalOnError() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is still needed for failure handling of the first part of failover process.

if (t != null) {
} catch (Throwable t) {
LOG.info("Unexpected error happens in region failover. Fail globally.", t);
failGlobal(t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use FatalExitExceptionHandler to handle exceptions thrown from failGlobal directly.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @Myasuka. I think we cannot merge these changes because they change the handling of exceptionally completed cancel futures.

FutureUtils.assertNoException(
cancelTasks(verticesToRestart)
.thenRunAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor())
.handle(failGlobalOnError()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not change this logic since cancelTasks can return an exceptionally completed future which would now cause the process to terminate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I am refactoring this PR by changing the result of RestartStrategy#restart to a CompletableFuture<?>

if (t != null) {
} catch (Throwable t) {
LOG.info("Unexpected error happens in region failover. Fail globally.", t);
failGlobal(t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to let the exception handling happen where it has been. Usually it is a good idea to do the exception handling at the highest possible level.

@GJL GJL self-assigned this Aug 1, 2019
* @return A {@link CompletableFuture} that will be completed when the restarting process is done.
*/
void restart(RestartCallback restarter, ScheduledExecutor executor);
CompletableFuture<?> restart(RestartCallback restarter, ScheduledExecutor executor);
Copy link
Member

@GJL GJL Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that CompletableFuture<Void> would convey clearly that "nothing" is returned by the future.
Also, wildcards in return types should be avoided: https://stackoverflow.com/a/22815270

* @param <T> type of the result
* @return Future which schedule the given operation with given delay.
*/
public static <T> CompletableFuture<T> scheduleAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not unit tested but I think it should be because it could be used outside of the restart strategies. See FutureUtilsTest

final JobGraph jobGraph,
final RestartStrategy restartStrategy) throws Exception {
final RestartStrategy restartStrategy,
final boolean failOnRecoverCheckpoint) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit testing the new behavior (i.e., failGlobal is invoked if an exception occurs while restarting), wouldn't it be enough to implement a RestartStrategy that returns an exceptionally completed future? I think it would be good enough because the exception doesn't have to be necessarily thrown while restoring checkpoints – in practice also an OOM can be thrown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, combine testFailGlobalIfErrorOnResetTasks and previous testFailGlobalIfErrorOnRescheduleTasks into one unit test as testFailGlobalIfErrorOnRestartTasks with given RestartStrategy which would fail when restarting.

unionListState = context.getOperatorStateStore().getUnionListState(unionStateDescriptor);
Set<Integer> actualIndices = StreamSupport.stream(unionListState.get().spliterator(), false).collect(Collectors.toSet());
Assert.assertTrue(CollectionUtils.isEqualCollection(EXPECTED_INDICES, actualIndices));
if (getRuntimeContext().getTaskName().contains(SINGLE_REGION_SOURCE_NAME)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change really necessary to resolve FLINK-13452, or can it be in a separate ticket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to let this integration test to include global failover scenario, this is necessary.

HighAvailabilityServicesUtilsTest.TestHAFactory.haServices = new FailHaServices(new TestingCheckpointRecoveryFactory(new FailRecoverCompletedCheckpointStore(1, 1), new StandaloneCheckpointIDCounter()), TestingUtils.defaultExecutor());
Configuration configuration = new Configuration();
configuration.setString(JobManagerOptions.EXECUTION_FAILOVER_STRATEGY, "region");
configuration.setString(HighAvailabilityOptions.HA_MODE, HighAvailabilityServicesUtilsTest.TestHAFactory.class.getName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting feature, didn't know about it before. However, I have two questions regarding this integration test:

  1. As before, isn't it enough to use a RestartStrategy that returns an exceptionally completed future?
  2. We already test that failGlobal is invoked from the unit tests. Does the integration test add coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the RegionFailoverITCase did not catch bug of FLINK-13452 due to it did not involve global failover but only region failover. I prefer to add global failover in this integration test to verify the job could be restarted and result still correct.

@GJL
Copy link
Member

GJL commented Aug 1, 2019

Thank you for your contribution to Apache Flink @Myasuka. I think the PR is going in the right direction.

@Myasuka
Copy link
Member Author

Myasuka commented Aug 2, 2019

@GJL new commit content is updated in description.

UPDATED:
Confusion below have been fixed with new commit d7c1f12.


There is another question which might be out of the scope of this PR but confused me. If I set the failure times of FailureMultiTimesRestartStrategy as 2, which means the restartStrategy would failed twice when calling restart. During AdaptedRestartPipelinedRegionStrategyNG#restartTasks, job would first meet exception when restartStrategy.restart and then call failGlobalOnError. However, in the call stack of ExecutionGraph#failGlobal --> ExecutionGraph#allVerticesInTerminalState --> ExecutionGraph#tryRestartOrFail --> restartStrategy.restart, we would meet the 2nd exception and caught by FatalExitExceptionHandler, resulting in the process exited.
In other words, if we meet unchecked exception during ExecutionGraph#failGlobal, we would just let the whole process exit instead of trying to fail global again. Is this behavior expected?

@GJL
Copy link
Member

GJL commented Aug 5, 2019

Thanks for the update. I'll have a look.

Copy link
Member

@GJL GJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, the PR looks good. There are some minor things that require attention. If you want, I can apply the last changes myself. WDYT?

}

private static final class FailingCompletedCheckpointStore implements CompletedCheckpointStore {
public static final class FailingCompletedCheckpointStore implements CompletedCheckpointStore {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be made private again.

}

private Runnable resetAndRescheduleTasks(final long globalModVersion, final Set<ExecutionVertexVersion> vertexVersions) {
private CompletableFuture<?> resetAndRescheduleTasks(final long globalModVersion, final Set<ExecutionVertexVersion> vertexVersions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would return Function<Object, CompletableFuture<Void>> here so that we can write:

FutureUtils.assertNoException(
			cancelTasks(verticesToRestart)
				.thenComposeAsync(resetAndRescheduleTasks(globalModVersion, vertexVersions), executionGraph.getJobMasterMainThreadExecutor())
				.handle(failGlobalOnError()));

This is consistent with the style before.

private static class TestException extends IOException{
private static final long serialVersionUID = 1L;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert this change (adding empty line).

*
* @param configuration Configuration containing the parameter values for the restart strategy
* @return Initialized instance of FixedDelayRestartStrategy
* @throws Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

* @param <T> type of the result
* @return Future which schedule the given operation with given delay.
*/
public static <T> CompletableFuture<T> scheduleAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the following overload:

	public static CompletableFuture<Void> scheduleWithDelay(
			final Runnable operation,
			final Time delay,
			final ScheduledExecutor scheduledExecutor) {
		Supplier<Void> operationSupplier = () -> {
			operation.run();
			return null;
		};
		return scheduleWithDelay(operationSupplier, delay, scheduledExecutor);
	}

This will reduce in the RestartStrategy implementations.

@Override
public CompletableFuture<Void> restart(RestartCallback restarter, ScheduledExecutor executor) {
if (restartedTimes.incrementAndGet() <= failureMultiTimes) {
CompletableFuture<Void> exceptionFuture = new CompletableFuture<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use return FutureUtils.completedExceptionally(new FlinkRuntimeException(...))

TestingUtils.defaultScheduledExecutor());

int result = scheduleAsyncFuture.get();
long completionTime = System.currentTimeMillis() - start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There some issues that I see here:

  1. System.currentTimeMillis() is impacted by system clock changes – this may cuase test failures in conjunction with NTP. Therefore System.nanoTime() should be preferred.
  2. Test will run at least 500ms which is on the slow side for unit tests. I think it's enough to use ManuallyTriggeredScheduledExecutor and see if the correct result is produced.
  3. I don't think using AtomicInteger adds test coverage. Imo it's enough to assert on the value of result.


final ScheduledFuture<?> scheduledFuture = scheduledTasks.iterator().next();

assertFalse(scheduledFuture.isDone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this assertion is needed because you already test if the future can be cancelled:

assertTrue(scheduledFuture.isCancelled());

This assertion will fail if the future is finished normally.

public CompletableFuture<Void> restart(final RestartCallback restarter, ScheduledExecutor executor) {
currentRestartAttempt++;
executor.schedule(restarter::triggerFullRecovery, delayBetweenRestartAttempts, TimeUnit.MILLISECONDS);
Supplier<Void> restartSupplier = () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion to Supplier is duplicated multiple times. See my other comment in FutureUtils.

@Myasuka
Copy link
Member Author

Myasuka commented Aug 5, 2019

Sorry for late reply.
@GJL , No problem, please go ahead to take over this PR to finish the rest part.

@GJL GJL force-pushed the fix-schedule-NG branch from 7afbb0b to 8e791fd Compare August 5, 2019 19:37
@GJL
Copy link
Member

GJL commented Aug 5, 2019

Please review @tillrohrmann

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @Myasuka and @GJL. LGTM. +1 for merging.

@GJL
Copy link
Member

GJL commented Aug 7, 2019

Merging as soon as build is green.

@GJL GJL force-pushed the fix-schedule-NG branch from 92cf36a to 3c6363e Compare August 7, 2019 05:22
@GJL
Copy link
Member

GJL commented Aug 7, 2019

Merged manually.

@GJL GJL closed this Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants