Skip to content

Conversation

@GJL
Copy link
Member

@GJL GJL commented Jun 28, 2019

What is the purpose of the change

Based on #8851

Brief change log

Verifying this change

This change is already covered by existing tests, such as RegionFailoverITCase.

This change added tests and can be verified as follows:

  • Added unit tests for AdaptedRestartPipelinedRegionStrategyNG

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

final List<ExecutionVertex> executionVertices = StreamSupport
.stream(getAllExecutionVertices().spliterator(), false)
.collect(Collectors.toList());
return SchedulingUtils.scheduleLazy(executionVertices, this);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchedulingUtils could accept Iterable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return new RestartAllStrategy.Factory();

case PIPELINED_REGION_RESTART_STRATEGY_NAME:
return new AdaptedRestartPipelinedRegionStrategyNG.Factory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

private static void assertNoPendingCheckpoints(final CheckpointCoordinator checkpointCoordinator) {
assertThat(checkpointCoordinator.getPendingCheckpoints().keySet(), hasSize(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also check that CheckpointCoordinator#getSuccessfulCheckpoints() is empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not needed since we are not acknowledging any checkpoints from the executions.

assertThat(failoverStrategy.getLastTasksToCancel(), containsInAnyOrder(
new ExecutionVertexID(ev11.getJobvertexId(), 0),
new ExecutionVertexID(ev21.getJobvertexId(), 0),
new ExecutionVertexID(ev21.getJobvertexId(), 1)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be easier to understand if this were ev22.getJobvertexId

Copy link
Member Author

@GJL GJL Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, I am waiting for the PR that enables me to write ev22.getID()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Commit cherry picked from #8804.

ev12.getCurrentExecutionAttempt().markFinished();

// force FINISHED ev11 to fail to reset its partition
strategy.onTaskFailure(ev11.getCurrentExecutionAttempt(), new FlinkException("Fail for testing"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this scenario to be possible in production. FAILED to FINISHED is not a valid state transition. Could we not fail one of the consumers instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete the test. The adapter is not calling IntermediateResultPartition#resetForNewExecution()

final List<ExecutionVertex> executionVertices = StreamSupport
.stream(getAllExecutionVertices().spliterator(), false)
.collect(Collectors.toList());
return SchedulingUtils.scheduleLazy(executionVertices, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


private final SlotProvider delegate;


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated empty line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed


final FailoverStrategy failoverStrategy = executionGraph.getFailoverStrategy();
failoverStrategy.onTaskFailure(firstVertex.getCurrentExecutionAttempt(), new Exception("Test Exception"));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manualMainThreadExecutor.triggerAll() is needed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't hurt but not needed since we are only interested in the state transition to ExecutionState.CANCELED.

Copy link
Contributor

@zhuzhurk zhuzhurk Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vertex directly transitions to CANCELED because it was no scheduled yet and the graph gets cancelled. It does not prove whether the local failover is skipped. (Even though a local failover is not skipped, the recovery action will be queued in main thread executor and not able to change the vertex state out from CANCELED).

If we want to verify "skip local failover if the job status is not Running", how about trigger task failure directly without graph.cancel()? In this case, the local failover will be skipped since the graph is in CREATED state, and the vertex will remain in CREATED state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add triggerAll() since it is a more realistic scenario, and it is less invasive (createExecutionGraph() already schedules the ExecutionGraph).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// trigger downstream regions to schedule
testingMainThreadExecutor.execute(() -> {
componentMainThreadExecutor.execute(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can invoke markFinished directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed it.

@GJL
Copy link
Member Author

GJL commented Jun 29, 2019

rebased to master

/** Config name for the {@link AdaptedRestartPipelinedRegionStrategyNG} */
public static final String PIPELINED_REGION_RESTART_STRATEGY_NAME = "region";

/** Config name for the {@link RestartPipelinedRegionStrategy} */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "." is needed at the end of the statement.
Also for the other 3 config name comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @GJL and @zhuzhurk. I've looked at the functional changes and they look good to me. I had some minor comments which we could address.

I will be looking now at the tests in detail.

}

/**
* Schedule vertices lazy. That means all vertices will be scheduled at once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eagerly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished my pass over the tests. The changes look good to me. Thanks a lot for the very good work @GJL and @zhuzhurk as well as for the review @zentol. I really like how you tested the code. One can really see that you put some good thoughts into it.

+1 for merging.

/**
* Adapts {@link ScheduledExecutor} to {@link ComponentMainThreadExecutor}.
*/
public class ScheduledExecutorToComponentMainThreadExecutorAdapter implements ComponentMainThreadExecutor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class duplicates ComponentMainThreadExecutorServiceAdapter. A related class is TestingComponentMainThreadExecutorServiceAdapter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

ComponentMainThreadExecutorServiceAdapter adapts ScheduledExecutorService. ScheduledExecutor is not a ScheduledExecutorService. There is no adapter from ScheduledExecutor to ScheduledExecutorService.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there should be ScheduledExecutorServiceAdapter for ScheduledExecutorService -> ScheduledExecutor, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made ComponentMainThreadExecutorServiceAdapter accept ScheduledExecutor and deleted my adapter. However, there are more pending clean ups that require a separate PR – for example, it does not make sense that TestingComponentMainThreadExecutorServiceAdapter extends ComponentMainThreadExecutorServiceAdapter.

@GJL
Copy link
Member Author

GJL commented Jul 3, 2019

rebased to master

@GJL
Copy link
Member Author

GJL commented Jul 3, 2019

Rebased to master and squashed commits. Merging as soon as build is green.

GJL added 2 commits July 3, 2019 20:19
…gacy scheduling

Implement adapter (AdaptedRestartPipelinedRegionStrategyNG) that adapts
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
to the legacy failover strategy interface
(org.apache.flink.runtime.executiongraph.failover.FailoverStrategy).

The new AdaptedRestartPipelinedRegionStrategyNG is chosen if config option
jobmanager.execution.failover-strategy is set to "region". The legacy behavior
can be enabled by setting the config option to "region-legacy".
@asfgit asfgit closed this in 6a682ff Jul 3, 2019
@GJL GJL deleted the FLINK-12876 branch September 3, 2019 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants