Skip to content

Conversation

@AHeise
Copy link
Contributor

@AHeise AHeise commented Jun 10, 2020

What is the purpose of the change

For multi-gate inputs, there existed inconsistent offset handling of BarrierHandlers and CheckpointedInputGates for unaligned checkpoints. This PR unifies the creation of BarrierHandlers and CheckpointedInputGate.

Brief change log

  • Use the non-unioned (Indexed)InputGates as the main information source that is needed to create handlers.
  • Unioned input gates are only used to create the CheckpointedInputGate.

Verifying this change

Covered by existing (modified) tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e78203a (Wed Jun 10 08:13:41 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 10, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @AHeise , the code also looks simpler now.

I think it can be improved a bit further, please see my comments below.

Can you also point me out which test exactly verifies the correctness of indices?

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've skimmed over the code (I didn't do the full review). Mostly it LGTM. General remark is that first couple of comments are missing explanation what are they doing in the commit message. They look like maybe they could be kept as individual commits, but it's hard to understand them now with the context, which such commit message could provide.

Arvid Heise added 5 commits June 16, 2020 09:35
The method assumed that the gates have consecutive indexes starting at 0.
In the following commits, this method will be used to fetch information about all channels without explicitly needing to access the channels. Thus, for tests mocks just need to return meaningful InputChannelInfos instead of actually creating the respective channels.
…reateCheckpointedInputGate to createCheckpointedMultipleInputGate.
…ile creating checkpoint handlers.

The actual implementation have been lists all along and we assume ordering anyways.
…lInfo.

This removes the need to translate the InputChannelInfo back and forth to flattened indexes across all InputGates.
All index-based data structures are replaced by maps that associate a certain state to a given InputChannelInfo. For performance reasons, these maps are fully initialized upon construction, such that no nodes need to be added/removed during runtime and only values are updated.
Additionally, this commit unifies the creation of BarrierHandlers (similar signature) and removes the error-prone offset handling from CheckpointedInputGate.
@AHeise
Copy link
Contributor Author

AHeise commented Jun 16, 2020

I've skimmed over the code (I didn't do the full review). Mostly it LGTM. General remark is that first couple of comments are missing explanation what are they doing in the commit message. They look like maybe they could be kept as individual commits, but it's hard to understand them now with the context, which such commit message could provide.

Thank you for your feedback. I expanded the commit messages.

Copy link
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this difficult (I guess) re-work @AHeise !:)

The changes LGTM overall, I have only a question about using Map instead of Set.


/** Flags that indicate whether a channel is currently blocked/buffered. */
private final boolean[] blockedChannels;
private final Map<InputChannelInfo, Boolean> blockedChannels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use Map<Key, Boolean> instead of just Set<Key> (also in CheckpointBarrierUnaligner, ThreadSafeUnaligner)?

(I guess we can avoid the penalty of dynamic reallocation by setting set capacity in advance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right (I tried to explain it in the commit message). Also internally HashSet uses LinkedHashMap, so it's even then slower than using HashMap directly even with dynamic reallocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also internally HashSet uses LinkedHashMap

Can you explain what do you mean?
In my OpenJDK 11 HashSet uses HashMap under the hood, and I believe this is what most implementations do.

Copy link
Contributor Author

@AHeise AHeise Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I misread the code. It's only used while calling LinkedHashSet -> super(HashSet).

Anyways, the current way is used to avoid dynamic reallocations. I can also write a small wrapper Set implementation (similar to EnumSet) if it's too hard to read right now.

Performance-wise there is no gain on a HashSet#containsKey on an empty set to a HashMap#get on a filled set with same capacity. However, I'd we avoid object creations and deletions if we just update the nodes on write access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's not a premature optimization (I think contains is faster for set but put/remove slower; and it all may not be visible).

But I'm also fine with the current approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashSet#contains uses HashMap#containsKey, which fetches the node of the key. Without hash collisions and overflow lists, that is exactly the same. The question is how many collisions we have. It may also depend on whether the respective Set is mostly filled or empty (e.g., blockedChannels should be mostly empty). We might want to potentially invert the current semantics to keep them empty as long as possible.

My main motivation was to avoid adding anything to GC pressure and using mostly a constant data structure as before to not change too much. I think I'd leave it as is for now. We can revise it when we adjust the threading model of Unaligner, which will be a major change on the code anyways.

@AHeise AHeise requested a review from pnowojski June 17, 2020 07:22
@AHeise
Copy link
Contributor Author

AHeise commented Jun 17, 2020

@flinkbot run azure

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @AHeise . Merging it to master, will probably merge it to release-1.11 after RC2 cut.

Build failure caused by e2e time out issues that we had yesterday/today morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants