[FLINK-12530][network] Move Task.inputGatesById to NetworkEnvironment #8463

azagrebin · 2019-05-16T09:24:23Z

What is the purpose of the change

Task.inputGatesById indexes SingleInputGates by id. The end user of this indexing is NetworkEnvironment for two cases:

SingleInputGate triggers producer partition readiness check and
then the successful result of check is dispatched back to this SingleInputGate by id.
We can just return a future from TaskActions.triggerPartitionProducerStateCheck.
SingleInputGate could use the future to react with re-triggering of the
partition request if the producer is ready. Then inputGatesById is not needed for dispatching.
TaskExecutor.updatePartitions uses inputGatesById to dispatch PartitionInfo update to the right SingleInputGate. If inputGatesById is moved to NetworkEnvironment, which should be a better place for gate management, and NetworkEnvironment.updatePartitionInfo is added then
TaskExecutor.updatePartitions could directly call NetworkEnvironment.updatePartitionInfo.

Additional refactoring:

TaskActions.triggerPartitionProducerStateCheck is
separated into another interface PartitionProducerStateProvider.
TaskActions is too broad interface used also for other purposes.
Shuffle API needs only PartitionProducerStateProvider .
PartitionProducerStateProvider returns future with the ResponseHandle which contains the producer state and accepts callbacks to cancel or fail consumption as a result of state check.
Task.triggerPartitionProducerStateCheck is also refactored into a RemoteChannelStateChecker which becomes internal detail of NetworkEnvironment. RemoteChannelStateChecker accepts ResponseHandle, checks whether producer is ready for consumption or aborts consumption using ResponseHandle.cancelConsumption or ResponseHandle.failConsumption.

Brief change log

Change TaskActions.triggerPartitionProducerStateCheck to react on future instead of inputGatesById
add NetworkEnviroment.updatePartitionInfo
use NetworkEnviroment.updatePartitionInfo in TaskExecutor.updatePartitions instead of Task.inputGatesById
move Move Task.inputGatesById to NetworkEnvironment
add SingleInputGate.close future and add a callback for it to remove gate from NetworkEnvironment.inputGatesById
Move TaskActions.triggerPartitionProducerStateCheck to a
separate interface PartitionProducerChecker
Refactor Task.triggerPartitionProducerStateCheck into RemoteChannelStateChecker

Verifying this change

the change is a refactoring

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2019-05-16T09:26:27Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ac96671 (Wed Aug 07 15:49:06 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❗ 3. Needs [attention] from.
- Needs attention by @zhijiangW
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

azagrebin · 2019-05-16T09:27:51Z

@flinkbot attention @zhijiangW

tillrohrmann

Thanks for opening this PR @azagrebin. I had some comment concerning the callbacks. Maybe we can do the same with a termination future and a future which is returned when calling triggerPartitionProducerStateCheck.

tillrohrmann · 2019-05-20T15:04:05Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

@@ -110,6 +117,8 @@ public NetworkEnvironment(

 		this.resultPartitionManager = new ResultPartitionManager();

+		this.inputGatesById = new ConcurrentHashMap<>();


Which threads do access this structure concurrently?

Canceler thread might close SingleInputGate while rpc thread is creating SingleInputGate.

tillrohrmann · 2019-05-20T15:04:42Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

@@ -150,6 +159,11 @@ public NetworkEnvironmentConfiguration getConfiguration() {
 		return config;
 	}

+	@VisibleForTesting
+	public Map<IntermediateDataSetID, SingleInputGate> getInputGatesById() {


Do we need to expose the map or would it be enough to have getInputGate(IntermediateDataSetID)?

We should actually inject it as dependency to check in tests

tillrohrmann · 2019-05-20T15:06:57Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

+	 * @throws PartitionException the input gate with the id from the partitionInfo is not found
+	 */
+	public void updatePartitionInfo(PartitionInfo partitionInfo)
+		throws IOException, InterruptedException, PartitionException {


Personally I'd prefer to put throws on the same line as ). The problem is that it now looks as if the throws line belongs to the body. Just personal taste, though.

tillrohrmann · 2019-05-20T15:08:33Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGate.java

@@ -634,7 +633,8 @@ void notifyChannelNonEmpty(InputChannel channel) {
 	}

 	void triggerPartitionStateCheck(ResultPartitionID partitionId) {
-		taskActions.triggerPartitionProducerStateCheck(jobId, consumedResultId, partitionId);
+		taskActions.triggerPartitionProducerStateCheck(consumedResultId, partitionId,
+			() -> retriggerPartitionRequest(partitionId.getPartitionId()));


Line breaking

tillrohrmann · 2019-05-20T15:15:41Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

+		SingleInputGate[] inputGates = network.createInputGates("",
+			new NoOpTaskActions(), Collections.singletonList(igdd),
+			new UnregisteredMetricsGroup(), new UnregisteredMetricsGroup(), new UnregisteredMetricsGroup(),
+			new SimpleCounter());


line breaks

tillrohrmann · 2019-05-20T15:16:21Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

+		assertTrue(network.getInputGatesById().containsKey(id));
+		inputGates[0].close();
+		assertEquals(0, network.getInputGatesById().size());
+		assertFalse(network.getInputGatesById().containsKey(id));


The two last assertions seem to be redundant.

I will rewrite a bit the test to make it more generic

tillrohrmann · 2019-05-20T15:47:56Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/TaskActions.java

 		IntermediateDataSetID intermediateDataSetId,
-		ResultPartitionID resultPartitionId);
+		ResultPartitionID resultPartitionId,
+		ThrowingRunnable<Exception> producerReadyCallback);


Instead of adding this callback, we could also let this method return a CompletableFuture<Void> which if completed indicates to retrigger the partition request. That way we would not have to pass in the callback which is forwarded to some other place.

tillrohrmann · 2019-05-20T15:52:36Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGate.java

 		IntermediateDataSetID consumedResultId,
 		final ResultPartitionType consumedPartitionType,
 		int consumedSubpartitionIndex,
 		int numberOfInputChannels,
 		TaskActions taskActions,
 		Counter numBytesIn,
-		boolean isCreditBased) {
+		boolean isCreditBased,
+		Runnable closeListener) {


For the close listener we could maybe apply a similar trick as with the partition request retriggering. We could for example add a termination future to the SingleInputGate which is completed once the gate gets closed. On this future we can register the removal from the inputGatesById map. WDYT?

tillrohrmann · 2019-05-20T15:55:28Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

+		IntermediateDataSetID intermediateResultPartitionID = partitionInfo.getIntermediateDataSetID();
+		SingleInputGate inputGate = inputGatesById.get(intermediateResultPartitionID);
+		if (inputGate == null) {
+			throw new PartitionException("No reader with ID " + intermediateResultPartitionID + " was found.");


Should this maybe be an IllegalStateException because this should actually not happen?

This is how it was reported before.. but I agree IllegalStateException fits better

tillrohrmann · 2019-05-20T15:57:55Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

-									log.error("Failed canceling task with execution ID {} after task update failure.", executionAttemptID, re);
-								}
+								task.failExternally(e);
+							} catch (RuntimeException re) {


I would suggest to remove this catch block and instead do the following:

FutureUtils.assertNoException( CompletableFuture.runAsync(() -> ..., getRpcService()));

zhijiangW · 2019-05-21T08:05:01Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+					() -> {
+						try {
+							networkEnvironment.updatePartitionInfo(partitionInfo);
+						} catch (IOException | PartitionException | InterruptedException e) {


We changed some previous behavior here but seems not covered by related tests before.
The PartitionException would trigger cancel task before, so the task final state might be canceled. But now we fail the task directly, so the task final state might be failed. Does it need to supplement a unit test?

I think before it would send PartitionException back to the JobMaster and which would then fail the Execution. This should have the same effect. Verifying whether this is guarded by a test makes sense, though.

Yes, I agree with the same effect, only concern the unit test. If we ever had the unit test for verifying the task is in CANCELED state after PartitionException, then this change should make the previous test failure because the task would be in FAILED state after change.

EDIT: never mind, wrong comment answered :)
True, TaskExecutorSubmissionTest.testUpdateTaskInputPartitionsFailure needs to be adjusted.

zhijiangW · 2019-05-21T08:36:49Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

@@ -542,6 +543,29 @@ public void testUpdateUnknownInputChannel() throws Exception {
 		}
 	}

+	@Test
+	public void checkInputGateRemoveInNetworkEnvironment() throws IOException {
+		NetworkEnvironment network = createNetworkEnvironment();


Do we need network.shutdown() in finally part?

zhijiangW · 2019-05-21T08:40:06Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

+		IntermediateDataSetID id = new IntermediateDataSetID();
+		InputGateDeploymentDescriptor igdd = new InputGateDeploymentDescriptor(id,
+			ResultPartitionType.PIPELINED, 0, channelDescs);
+		SingleInputGate[] inputGates = network.createInputGates("",


For the array case, it seems better to make size more than 1, such as 2? I am not very sure.

zhijiangW · 2019-05-21T08:42:25Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

@@ -542,6 +543,29 @@ public void testUpdateUnknownInputChannel() throws Exception {
 		}
 	}

+	@Test


In addition, do you think we should also cover the current case for closing SingleInputGate and creatingInputGates in different threads?

I think we can rely on ConcurrentHashMap provided in constructor (I changed the approach a bit), then there is no need to test ConcurrentHashMap.

zhijiangW

Thanks for opening this PR @azagrebin !

The changes look almost good to me, only left some inline comments.

azagrebin · 2019-05-21T14:14:02Z

Thanks for the review @tillrohrmann @zhijiangW !
I have addressed the comments.
I have also rebased it on #8416 because it is about to be merged.

azagrebin · 2019-05-21T14:55:38Z

I have also pushed one more hotfix to separate PartitionProducerChecker interface from TaskActions because it is the only thing which InputGate requires for shuffle service API.

zhijiangW · 2019-05-22T03:03:27Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

@@ -119,11 +118,21 @@ private NetworkEnvironment(
 		this.isShutdown = false;
 	}

+	public static NetworkEnvironment create(
+		NetworkEnvironmentConfiguration config,


Keep the same indentation as the below create method?

zhijiangW · 2019-05-22T03:19:02Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGateTest.java

-		inputGates[0].close();
-		assertEquals(0, network.getInputGatesById().size());
-		assertFalse(network.getInputGatesById().containsKey(id));
+	public void checkInputGateRemoveInNetworkEnvironment() throws Exception {


checkInputGateRemoveInNetworkEnvironment -> testInputGateRemoveInNetworkEnvironment ?

zhijiangW · 2019-05-22T05:28:40Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

@@ -86,37 +89,90 @@

 	private final ResultPartitionManager resultPartitionManager;

+	private final Map<IntermediateDataSetID, SingleInputGate> inputGatesById;


Maybe it is more clearly to define ConcurrentHashMap instead of Map here.

Let's keep Map because NetworkEnviroment does not depend on ConcurrentHashMap methods at the moment.

zhijiangW · 2019-05-22T05:49:22Z

Thanks for the updates @azagrebin!

I left several minor comments, only one concern of new proposed PartitionProducerChecker which seems to provide the similar function as current PartitionProducerStateChecker. The only difference is the parameter JobID in PartitionProducerStateChecker#requestPartitionProducerState. And the JobID is not used in current process, so we might remove this parameter and then no need to introduce the new interface PartitionProducerChecker. WDYT?

tillrohrmann

Thanks for updating this PR @azagrebin. I had some more comments.

tillrohrmann · 2019-05-22T15:05:31Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGate.java

@@ -441,6 +444,7 @@ public void close() throws IOException {
 				finally {
 					isReleased = true;


I think we could remove this field because we have now the closeFuture

tillrohrmann · 2019-05-22T15:08:27Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+					() -> {
+						try {
+							networkEnvironment.updatePartitionInfo(partitionInfo);
+						} catch (Throwable t) {


I think we should not catch Throwable here. If an unchecked exception occurs, it should simply bubble up and cause the component to fail.

tillrohrmann · 2019-05-22T15:17:25Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java

+								() -> task.failExternally(t),
+								getRpcService().getExecutor()));
+						}
+					});


I would suggest to exchange this block with:

FutureUtils.assertNoException( CompletableFuture.runAsync( () -> { try { networkEnvironment.updatePartitionInfo(partitionInfo); } catch (IOException | InterruptedException e) { log.error("Could not update input data location for task {}. Trying to fail task.", task.getTaskInfo().getTaskName(), e); task.failExternally(e); } }, getRpcService().getExecutor()));

tillrohrmann · 2019-05-22T15:25:38Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

@@ -1102,11 +1093,13 @@ public void triggerPartitionProducerStateCheck(
 					} else {
 						failExternally(throwable);
 					}
-				} catch (IOException | InterruptedException e) {
-					failExternally(e);
+				} catch (Throwable t) {


We should not catch Throwable because it could be a legitimate Flink problem which should make us to fail the process.

tillrohrmann · 2019-05-22T15:44:18Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

 				}
 			},
 			executor);
+
+		return producerReadyFuture;


What about the following alternative:

final CompletableFuture<Boolean> producerReadyFuture = new CompletableFuture<>(); FutureUtils.assertNoException( futurePartitionState.whenCompleteAsync( (ExecutionState executionState, Throwable throwable) -> { if (executionState != null || throwable instanceof TimeoutException) { final boolean producerReady = onPartitionStateUpdate( resultPartitionId, executionState != null ? executionState : ExecutionState.RUNNING); producerReadyFuture.complete(producerReady); } else { if (throwable instanceof PartitionProducerDisposedException) { String msg = String.format("Producer %s of partition %s disposed. Cancelling execution.", resultPartitionId.getProducerId(), resultPartitionId.getPartitionId()); LOG.info(msg, throwable); cancelExecution(); } else { failExternally(throwable); } producerReadyFuture.complete(false); } }, executor)); return producerReadyFuture;

tillrohrmann · 2019-05-22T15:49:36Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

 			ResultPartitionID resultPartitionId,
-			ExecutionState producerState) throws IOException, InterruptedException {
+			ExecutionState producerState,
+			CompletableFuture<Void> producerReadyFuture) {


Instead of passing in this future, I would return a boolean indicating whether the producer is ready or not

tillrohrmann · 2019-05-22T15:55:13Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

@@ -1221,53 +1214,46 @@ public void run() {
 	 */
 	@VisibleForTesting
 	void onPartitionStateUpdate(


Maybe rename into isProducerReadyAndHandlePartitionStateUpdate

tillrohrmann · 2019-05-22T16:00:20Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

@@ -97,13 +104,15 @@ private NetworkEnvironment(
 			NetworkBufferPool networkBufferPool,
 			ConnectionManager connectionManager,
 			ResultPartitionManager resultPartitionManager,
+			Map<IntermediateDataSetID, SingleInputGate> inputGatesById,


Passing the map for storing the input gates into NetworkEnvironment exposes in my opinion a bit too many implementation details of this structure. I basically says that the object needs to store the SingleInputGates in this structure. Otherwise the test will fail. I think it would be better to have a getInputGate(IntermediateDataSetID) and let the way the gates are stored internally be an implementation detail.

azagrebin · 2019-05-24T08:33:37Z

Thanks for the further review @tillrohrmann. I have addressed comments.

azagrebin · 2019-05-24T08:43:43Z

Thanks for the review @zhijiangW, I've addressed comments.
I think PartitionProducerStateChecker and PartitionProducerChecker have different levels of abstraction. The state checker is used by Task to query the execution state of the producer from JM, basically their RPC interface. Then task makes further decisions based on it. Shuffle API (remote channel) is basically interested in only whether the partition producer is producing or not at the moment. I think, there is a bit coupling between Task decisions and netty implementation at the moment. We might later further rethink it and maybe move Task logic completely into netty implementation based on just the producer execution state.

zhijiangW · 2019-05-24T10:21:11Z

Thanks for the explanation @azagrebin !

If not caring about the detail implementation, we should think through the partition checker logic and make clear the scope owner of it. The below sharing is just my personal thought, maybe not very correct:

During requesting partition, RemoteInputChannel/InputGate might receive PartitionNotFoundException, so RemoteInputChannel/InputGate should decide how to handle this exception. It could throw this exception to the outside directly to cause task fail. Or it wants to further query partition's state to make the final decision.
The checker/query should be targeting the partition's state, not producer's state. If the producer state is FINISHED but the partition state might be RELEASED, then only the partition's state could give the right decision.
ShuffleMaster could provide the ability for querying partition's state future, just like ShuffleMaster would communicate with ShuffleService for releasing partition. For simple implementation, we could make use of the RPC between TM/JM for the communication.

If the current partition checker refactor in this PR might not be the final way/direction to go, it is better not to touch it now, since it is not very related to the scope of moving inputGatesById. Or we could forward step by step and keep the current refactor in this PR.

azagrebin · 2019-05-27T08:15:14Z

Thanks for the thoughts @zhijiangW !

True, this state might be not final. Further steps in this topic are probably more related to the partition lifecycle management and out of scope of this PR.

The main motivation of introducing the PartitionProducerChecker now was to separate it from TaskActions. TaskActions is very broad interface used in other components for unrelated purposes. Shuffle API does not need it in the scope of the first refactoring. Introducing PartitionProducerChecker improves Shuffle API decoupling because it reflects only what is needed at the moment. I can move this hotfix into another PR if it feels like this but it is small and has been already reviewed. Also the related code is touched anyway in this PR.

zhijiangW · 2019-05-27T08:32:10Z

Thanks for the confirmation, @azagrebin !
I am not caring about keeping that hotfix in this PR, no need to submit it separately. :)

tillrohrmann

Thanks for updating this PR @azagrebin. The changes look good to me. I had one last comment about the consumerExecutionState and whether we can move it from the RemoteChannelStateChecker to the Task.

tillrohrmann · 2019-05-28T20:15:49Z

...ntime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionStateChecker.java

+	 * Result of partition state check, accepts check callbacks.
+	 */
+	interface CheckResult {
+		ExecutionState getConsumerExecutionState();


Do we really need the consumer's execution state here?

At the moment we use it as a shortcut if consumer is done before the check has been accomplished then we skip the check. Ideally we should use closeFuture in SingleInputGate instead of ConsumerExecutionState for this. Although this shortcut is the best effort because the state change or gate close can happen concurrently right after this check, I suggest we consider it as a separate refactoring as it would change a bit current behaviour: https://issues.apache.org/jira/browse/FLINK-12672

tillrohrmann · 2019-05-28T20:16:37Z

...untime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/InputGateID.java

+/**
+ * Runtime identifier of a consumed {@link org.apache.flink.runtime.executiongraph.IntermediateResult}.
+ *
+ * <p>In runtime the {@link org.apache.flink.runtime.jobgraph.IntermediateDataSetID} is not enough to uniquely


nit: At runtime

tillrohrmann · 2019-05-28T20:26:39Z

...n/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteChannelStateChecker.java

+		ExecutionState consumerState = checkResult.getConsumerExecutionState();
+		Either<ExecutionState, Throwable> result = checkResult.getProducerExecutionState();
+		ExecutionState producerState = result.isLeft() ? result.left() : ExecutionState.RUNNING;
+		return consumerState == ExecutionState.RUNNING &&


Can we check the consumerState outside of the RemoteChannelStateChecker? If the consumer (==this) is not running, then we should simply ignore the update message.

https://issues.apache.org/jira/browse/FLINK-12672

tillrohrmann · 2019-05-28T20:27:26Z

...n/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteChannelStateChecker.java

+	private boolean isProducerConsumerReady(CheckResult checkResult) {
+		ExecutionState consumerState = checkResult.getConsumerExecutionState();
+		Either<ExecutionState, Throwable> result = checkResult.getProducerExecutionState();
+		ExecutionState producerState = result.isLeft() ? result.left() : ExecutionState.RUNNING;


This logic is duplicated here and in abortConsumptionOrIgnoreCheckResult. Would be good to deduplicate it.

tillrohrmann · 2019-05-28T20:28:28Z

...n/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteChannelStateChecker.java

+
+				checkResult.failConsumption(new IllegalStateException(msg));
+			}
+		} else {


The else branch would not be needed if we check the consumer state before calling into the RemoteChannelStateChecker.

tillrohrmann · 2019-05-28T20:35:12Z

...n/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteChannelStateChecker.java

+		this.taskNameWithSubtask = taskNameWithSubtask;
+	}
+
+	public boolean isProducerConsumerReadyOrAbortConsumption(CheckResult checkResult) {


Not super important and out of scope for this PR, but this method either triggers some action (fail or cancel) or returns a decision (trigger new partition check or not). I think it would be more symmetric if this class would not trigger any action but only return a decision what to do:

enum Action { FAIL(Throwable cause), CANCEL(String msg), TRIGGER_PARTITION_CHECK, NOOP }

Then the caller would be responsible for making the action. That way this class would only need access to checkResult.getProducerExecutionState() and not checkResult itself.

True, I would suggest we consider this refactoring as another issue

https://issues.apache.org/jira/browse/FLINK-13638

tillrohrmann

Thanks for addressing my comments @azagrebin. LGTM. +1 for merging.

zhijiangW

Also LGTM on my side!

Task.inputGatesById indexes SingleInputGates by id. The end user of this indexing is NetworkEnvironment for two cases: - SingleInputGate triggers producer partition readiness check and then the successful result of check is dispatched back to this SingleInputGate by id. We can just return a future from TaskActions.triggerPartitionProducerStateCheck. SingleInputGate could use the future to react with re-triggering of the partition request if the producer is ready. Then inputGatesById is not needed for dispatching. - TaskExecutor.updatePartitions uses inputGatesById to dispatch PartitionInfo update to the right SingleInputGate. If inputGatesById is moved to NetworkEnvironment, which should be a better place for gate management, and NetworkEnvironment.updatePartitionInfo is added then TaskExecutor.updatePartitions could directly call NetworkEnvironment.updatePartitionInfo. Additional refactoring: - TaskActions.triggerPartitionProducerStateCheck is separated into another interface PartitionProducerStateProvider. TaskActions is too broad interface used also for other purposes. Shuffle API needs only PartitionProducerStateProvider. - PartitionProducerStateProvider returns future with the ResponseHandle which contains the producer state and accepts callbacks to cancel or fail consumption as a result of state check. - Task.triggerPartitionProducerStateCheck is also refactored into a RemoteChannelStateChecker which becomes internal detail of NetworkEnvironment. RemoteChannelStateChecker accepts ResponseHandle, checks whether producer is ready for consumption or aborts consumption using ResponseHandle.cancelConsumption or ResponseHandle.failConsumption.

azagrebin · 2019-05-29T20:13:41Z

Thanks for the reviews @tillrohrmann @zhijiangW !
I squashed the commits and adjusted the comments.

rmetzger added the review=description? label May 16, 2019

rmetzger added the component=Runtime/Network label May 16, 2019

tillrohrmann self-assigned this May 20, 2019

tillrohrmann requested changes May 20, 2019

View reviewed changes

zhijiangW reviewed May 21, 2019

View reviewed changes

zhijiangW requested changes May 21, 2019

View reviewed changes

azagrebin force-pushed the FLINK-12530 branch from 5ee6380 to e4f68ec Compare May 21, 2019 14:11

azagrebin force-pushed the FLINK-12530 branch from 3628086 to f380bbe Compare May 21, 2019 15:40

zhijiangW reviewed May 22, 2019

View reviewed changes

azagrebin force-pushed the FLINK-12530 branch 3 times, most recently from fb57607 to 9ccba19 Compare May 22, 2019 12:22

tillrohrmann requested changes May 22, 2019

View reviewed changes

azagrebin force-pushed the FLINK-12530 branch 2 times, most recently from a627eb9 to 0f41bbd Compare May 24, 2019 08:32

azagrebin force-pushed the FLINK-12530 branch 5 times, most recently from 92b5c32 to 3bc7873 Compare May 28, 2019 12:34

azagrebin mentioned this pull request May 28, 2019

[FLINK-12201][network,metrics] Introduce InputGateWithMetrics in Task to increment numBytesIn metric #8320

Merged

tillrohrmann requested changes May 28, 2019

View reviewed changes

azagrebin force-pushed the FLINK-12530 branch 2 times, most recently from fb3d560 to a469b69 Compare May 29, 2019 14:15

tillrohrmann approved these changes May 29, 2019

View reviewed changes

zhijiangW approved these changes May 29, 2019

View reviewed changes

azagrebin force-pushed the FLINK-12530 branch from a469b69 to ac96671 Compare May 29, 2019 20:12

zentol merged commit 809e40d into apache:master May 30, 2019

		@@ -110,6 +117,8 @@ public NetworkEnvironment(

		this.resultPartitionManager = new ResultPartitionManager();

		this.inputGatesById = new ConcurrentHashMap<>();

               		}
               	}
+              	@Test

		@@ -86,37 +89,90 @@

		private final ResultPartitionManager resultPartitionManager;

		private final Map<IntermediateDataSetID, SingleInputGate> inputGatesById;

		@@ -441,6 +444,7 @@ public void close() throws IOException {
		finally {
		isReleased = true;

[FLINK-12530][network] Move Task.inputGatesById to NetworkEnvironment #8463

[FLINK-12530][network] Move Task.inputGatesById to NetworkEnvironment #8463

Conversation

azagrebin commented May 16, 2019 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented May 16, 2019 • edited Loading

Automated Checks

Review Progress

azagrebin commented May 16, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azagrebin May 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhijiangW left a comment

Choose a reason for hiding this comment

azagrebin commented May 21, 2019 • edited Loading

azagrebin commented May 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhijiangW commented May 22, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azagrebin commented May 24, 2019

azagrebin commented May 24, 2019 • edited Loading

zhijiangW commented May 24, 2019

azagrebin commented May 27, 2019

zhijiangW commented May 27, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

zhijiangW left a comment

Choose a reason for hiding this comment

azagrebin commented May 29, 2019

azagrebin commented May 16, 2019 •

edited

Loading

flinkbot commented May 16, 2019 •

edited

Loading

azagrebin May 21, 2019 •

edited

Loading

azagrebin commented May 21, 2019 •

edited

Loading

azagrebin commented May 24, 2019 •

edited

Loading