[FLINK-18094][checkpointing] Unifies the creation of BarrierHandlers and CheckpointedInputGate. #12575

AHeise · 2020-06-10T08:10:15Z

What is the purpose of the change

For multi-gate inputs, there existed inconsistent offset handling of BarrierHandlers and CheckpointedInputGates for unaligned checkpoints. This PR unifies the creation of BarrierHandlers and CheckpointedInputGate.

Brief change log

Use the non-unioned (Indexed)InputGates as the main information source that is needed to create handlers.
Unioned input gates are only used to create the CheckpointedInputGate.

Verifying this change

Covered by existing (modified) tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2020-06-10T08:13:42Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e78203a (Wed Jun 10 08:13:41 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-06-10T08:40:15Z

CI report:

6713d2a UNKNOWN
0cd0ba6 Azure: CANCELED

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

rkhachatryan

Thanks for the fix @AHeise , the code also looks simpler now.

I think it can be improved a bit further, please see my comments below.

Can you also point me out which test exactly verifies the correctness of indices?

...k-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/InputProcessorUtil.java

...ing-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierUnaligner.java

flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/InputGateUtil.java

pnowojski

I've skimmed over the code (I didn't do the full review). Mostly it LGTM. General remark is that first couple of comments are missing explanation what are they doing in the commit message. They look like maybe they could be kept as individual commits, but it's hard to understand them now with the context, which such commit message could provide.

...ime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/UnionInputGate.java

...k-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/InputProcessorUtil.java

The method assumed that the gates have consecutive indexes starting at 0.

In the following commits, this method will be used to fetch information about all channels without explicitly needing to access the channels. Thus, for tests mocks just need to return meaningful InputChannelInfos instead of actually creating the respective channels.

…reateCheckpointedInputGate to createCheckpointedMultipleInputGate.

…ile creating checkpoint handlers. The actual implementation have been lists all along and we assume ordering anyways.

…lInfo. This removes the need to translate the InputChannelInfo back and forth to flattened indexes across all InputGates. All index-based data structures are replaced by maps that associate a certain state to a given InputChannelInfo. For performance reasons, these maps are fully initialized upon construction, such that no nodes need to be added/removed during runtime and only values are updated. Additionally, this commit unifies the creation of BarrierHandlers (similar signature) and removes the error-prone offset handling from CheckpointedInputGate.

AHeise · 2020-06-16T07:36:44Z

I've skimmed over the code (I didn't do the full review). Mostly it LGTM. General remark is that first couple of comments are missing explanation what are they doing in the commit message. They look like maybe they could be kept as individual commits, but it's hard to understand them now with the context, which such commit message could provide.

Thank you for your feedback. I expanded the commit messages.

rkhachatryan

Thanks for this difficult (I guess) re-work @AHeise !:)

The changes LGTM overall, I have only a question about using Map instead of Set.

rkhachatryan · 2020-06-16T07:49:42Z

...aming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierAligner.java

-
 	/** Flags that indicate whether a channel is currently blocked/buffered. */
-	private final boolean[] blockedChannels;
+	private final Map<InputChannelInfo, Boolean> blockedChannels;


Why do we use Map<Key, Boolean> instead of just Set<Key> (also in CheckpointBarrierUnaligner, ThreadSafeUnaligner)?

(I guess we can avoid the penalty of dynamic reallocation by setting set capacity in advance)

Yes, you are right (I tried to explain it in the commit message). Also internally HashSet uses LinkedHashMap, so it's even then slower than using HashMap directly even with dynamic reallocation.

Also internally HashSet uses LinkedHashMap

Can you explain what do you mean?
In my OpenJDK 11 HashSet uses HashMap under the hood, and I believe this is what most implementations do.

Never mind, I misread the code. It's only used while calling LinkedHashSet -> super(HashSet).

Anyways, the current way is used to avoid dynamic reallocations. I can also write a small wrapper Set implementation (similar to EnumSet) if it's too hard to read right now.

Performance-wise there is no gain on a HashSet#containsKey on an empty set to a HashMap#get on a filled set with same capacity. However, I'd we avoid object creations and deletions if we just update the nodes on write access.

I'm not sure if it's not a premature optimization (I think contains is faster for set but put/remove slower; and it all may not be visible).

But I'm also fine with the current approach.

HashSet#contains uses HashMap#containsKey, which fetches the node of the key. Without hash collisions and overflow lists, that is exactly the same. The question is how many collisions we have. It may also depend on whether the respective Set is mostly filled or empty (e.g., blockedChannels should be mostly empty). We might want to potentially invert the current semantics to keep them empty as long as possible.

My main motivation was to avoid adding anything to GC pressure and using mostly a constant data structure as before to not change too much. I think I'd leave it as is for now. We can revise it when we adjust the threading model of Unaligner, which will be a major change on the code anyways.

AHeise · 2020-06-17T12:30:24Z

@flinkbot run azure

pnowojski

Thanks for the fix @AHeise . Merging it to master, will probably merge it to release-1.11 after RC2 cut.

Build failure caused by e2e time out issues that we had yesterday/today morning.

rmetzger added the review=description? label Jun 10, 2020

rmetzger added the component=Runtime/Checkpointing label Jun 10, 2020

rkhachatryan reviewed Jun 10, 2020

View reviewed changes

pnowojski reviewed Jun 16, 2020

View reviewed changes

...ime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/UnionInputGate.java Outdated Show resolved Hide resolved

...k-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/InputProcessorUtil.java Show resolved Hide resolved

Arvid Heise added 5 commits June 16, 2020 09:35

[FLINK-18094][network] Fixed UnionInputGate#getChannel.

685bc59

The method assumed that the gates have consecutive indexes starting at 0.

[FLINK-18094][network] Simplifying InputProcessorUtil by delegating c…

4d7deb3

…reateCheckpointedInputGate to createCheckpointedMultipleInputGate.

[FLINK-18094][network] Using lists instead of collections of gates wh…

ae87825

…ile creating checkpoint handlers. The actual implementation have been lists all along and we assume ordering anyways.

rkhachatryan reviewed Jun 16, 2020

View reviewed changes

AHeise requested a review from pnowojski June 17, 2020 07:22

rkhachatryan approved these changes Jun 17, 2020

View reviewed changes

pnowojski reviewed Jun 17, 2020

View reviewed changes

pnowojski merged commit 1bacaee into apache:master Jun 17, 2020

AHeise mentioned this pull request Jun 17, 2020

[FLINK-18094][checkpointing] Unifies the creation of BarrierHandlers and CheckpointedInputGate. #12697

Merged

AHeise deleted the FLINK-18094 branch June 17, 2020 13:55

[FLINK-18094][checkpointing] Unifies the creation of BarrierHandlers and CheckpointedInputGate. #12575

[FLINK-18094][checkpointing] Unifies the creation of BarrierHandlers and CheckpointedInputGate. #12575

Uh oh!

Conversation

AHeise commented Jun 10, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jun 10, 2020

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AHeise commented Jun 16, 2020

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

AHeise Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

AHeise Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

AHeise Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

AHeise commented Jun 17, 2020

Uh oh!

pnowojski left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flinkbot commented Jun 10, 2020 •

edited

Loading

AHeise Jun 16, 2020 •

edited

Loading

pnowojski left a comment •

edited

Loading