Skip to content

Conversation

@AHeise
Copy link
Contributor

@AHeise AHeise commented Jan 28, 2021

This PR depends on #14797 (@AHeise commits at the bottom)

This PR fixes two bugs on unaligned checkpoints. First:

    If previous checkpoint is declined, it can happen that task receives both older and newer
    checkpoint barrier on two different channels, before processing any checkpoint cancellation
    message/RPC. If the newer checkpoint barrier happens to be processed before the obsolete one
    incorrect `checkState` in ChannelStatePersister would cause job failure. This checkState
    was assuming that the previous checkpoint would have been aborted/stopped before triggering
    the new one, while in reality, this previous checkpoint has never been triggered on this task
    so it also could not have been stopped.

Second:

    This commit fixes a bug where RemoteInputChannel was incorrectly deciding which
    buffers should be spilled, if it has received an obsoleted CheckpointBarrier,
    that hasn't been cancelled (yet?).

Both commits are tested by the existing UnalignedCheckpointITCase and some freshly added unit tests.

Further, it addresses some issues in cancellation:

During cancellation it may happen that CheckpointedInputGate may not poll a priority event if the 
corresponding channel has already been released. Until race conditions are removed, it safest to 
simply ignore an empty poll.
Do not enqueue released channels into the input gate.

Both commits are also tested by the existing UnalignedCheckpointITCase and covered by 1 new unit tests each.

Lastly, there is a fix for UnalignedCheckpointITCase itself, which could have run indefinitively if there is a cancellation after the final expected checkpoint.

Two related side commits increase the debuggability of network code, especially in conjunction with unaligned checkpoint.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 28, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit f7f5fb6 (Fri May 28 08:15:12 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 28, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a couple of minor comments.

Arvid Heise and others added 5 commits January 29, 2021 20:57
…n there is a failure while finishing.

Also improved logging and check for data corruption.
… UnalignedCheckpoints

If previous checkpoint is declined, it can happen that task receives both older and newer
checkpoint barrier on two different channels, before processing any checkpoint cancellation
message/RPC. If the newer checkpoint barrier happens to be processed before the obsolete one
incorrect `checkState` in ChannelStatePersister would cause job failure. This checkState
was assuming that the previous checkpoint would have been aborted/stopped before triggering
the new one, while in reality, this previous checkpoint has never been triggered on this task
so it also could not have been stopped.
…oteInputChannel

This commit fixes a bug where RemoteInputChannel was incorrectly deciding which
buffers should be spilled, if it has received an obsoleted CheckpointBarrier,
that hasn't been cancelled (yet?).
During cancellation it may happen that CheckpointedInputGate may not poll a priority event if the corresponding channel has already been released. Until race conditions are removed, it safest to simply ignore an empty poll.
@AHeise AHeise changed the title [FLINK-21104][network] Various fixes for UC [FLINK-20654][FLINK-21104][network] Fix couple bugs in the handling of unaligned checkpoints and cancellations. Jan 29, 2021
@AHeise AHeise marked this pull request as ready for review January 29, 2021 20:04
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, azure green merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants