Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

Merged
merged 11 commits into from Jan 25, 2021

Conversation

rkhachatryan
Copy link
Contributor

What is the purpose of the change

Update checkpoint statistics (shown in the web UI) even after a checkpoint fails
(this would facilitate investigation of issues with slow checkpointing).

With this change, failed checkpoint stats is updated when:

  1. Subtask acks a checkpoint too late or after some other failure. AsyncCheckpointRunnable completes normally and reports snapshot as usual. CheckpointCoordinator was updated to handle these calls
  2. Subtask receives abortion notification and cancels the runnable before it completes. In this case it only reports the metrics. Both TM and JM sides were updated and a new RPC added

Verifying this change

This change added tests and can be verified as follows:

  • CheckpointCoordinatorTest.testCheckpointStatsUpdatedAfterFailure
  • CheckpointCoordinatorTest.testAbortedCheckpointStatsUpdatedAfterFailure
  • Manually verified the change by running DataStreamAllroundTestProgram on local cluser:
execution.checkpointing.interval: 10s
execution.checkpointing.min-pause: 1s
execution.checkpointing.timeout: 1s
execution.checkpointing.tolerable-failed-checkpoints: 1000000
execution.checkpointing.unaligned: true
taskmanager.numberOfTaskSlots: 8
web.checkpoints.history: 100

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 13, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit bd8d680 (Fri May 28 08:06:00 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 13, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @rkhachatryan . I've left a couple of comments and questions (I might haven't fully understood some things)

@rkhachatryan rkhachatryan force-pushed the f19462-v2 branch 2 times, most recently from 0a44c3a to fcc06b3 Compare January 14, 2021 17:00
@rkhachatryan
Copy link
Contributor Author

Thanks for reviewing, @pnowojski.
I've addressed your feedback, PTAL.

@rkhachatryan rkhachatryan force-pushed the f19462-v2 branch 2 times, most recently from 026f0b5 to 2e65254 Compare January 15, 2021 00:53
…arguments

This is a preparatory step for FLINK-19462.

CheckpointStatsTracker is created with a fixed set of vertices.
At time of checkpoint creation this set can be different.
As checkpoint already carries the vertices there is no need to store
them as state.

Besides that, changing the type from List<ExecutionJobVertex>
to Map<JobVertexID, Integer> simplifies writing the tests.
This is a preparatory step for FLINK-19462.

Motivation:
1. Ability to report metrics without state snapshot (subsequent commit)
2. Consistency with other metrics
This is a preparatory step for FLINK-19462.
…stent

This is a preparatory step for FLINK-19462.

Access to checkpoint stats by ID without calling history.snapshot()
allows to update failed checkpoint stats without PendingCheckpoint
instance.
…heckpointStats

Inherit mutation semantics from PendingCheckpointStats
to allow updates even after checkpoint failed.
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried out those changes and I think they are not working well as it is now. I've modified WordCount example to randomly timeout async checkpoint phase and I wanted to see what does it look like.

Two screenshots: https://imgur.com/a/sv6w2IJ

With this PR there is no way to know which task has failed checkpoint. Also async phase timed out (5 seconds) and the end to end duration is in milliseconds range for all tasks/subtasks? Async duration is also in [0ms, 5ms] range for all tasks, even the one that has timed out.

I think it's strictly necessary to:

  • clearly mark which checkpoint for which subtask has failed
  • if we were not able to collect/calculate a metric, it must be N/A - not just 0ms
    I think it's almost must have to:
  • correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A

Otherwise this is very misleading :(

(I haven't tried it with different kind of failures, but this should be tried out before merging)

@rkhachatryan
Copy link
Contributor Author

Thanks a lot for trying it out.

I think it's strictly necessary to:
clearly mark which checkpoint for which subtask has failed

It is not always the task that fails a checkpoint. Timeout decision is made by the CheckpointCoordinator.
Multiple tasks can fail independently as well.
I agree that marking "failed" tasks would be useful but I don't think it's directly related to this feature or at least this PR.

if we were not able to collect/calculate a metric, it must be N/A - not just 0ms

I don't see 0ms on your screenshots nor while running locally. Do you mean 0 B per operator?
If so, why is it incorrect? (I do see non-zero size running cluster).

correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A

A checkpoint can be cancelled before even being started on some subtasks.

@rkhachatryan
Copy link
Contributor Author

I've updated the PR (adding 4 new commits):

  1. Tasks reporting upon abort RPC are marked as aborted in e2e duration column
  2. Only tasks that actually ACKed checkpoint are counted for ackCount and lastAckTime
  3. -1B is shown as - (the same way as durations)
  4. Fix the docs

image

cc: @NicoK

@NicoK
Copy link
Contributor

NicoK commented Jan 22, 2021

So, as soon as we are through the sync phase, we will get stats (if the CP is aborted during the sync phase, that won't interrupt the sync part anyway and will wait for it to complete). If we didn't reach the sync phase yet, the timeout could be because of slowly moving barriers (no barrier was received yet) or slow alignment (some barriers received but not all). These could be derived from looking at backpressure or data skew or starting times of other subtasks or timings from previous subtasks.

I think, the current state is a good step forward and the stats look good 👍

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @rkhachatryan. % some minor comments, there is potentially one more issue. If for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure. In that case I see only n/a for the whole subtask, which is a bit inconsistent with timeout behaviour.

This change introduces a new RPC from TM to JM.
Existing one can't be used because it:
a) confirms the checkpoint
b) requires task state snapshot

The call is issued after cancelling task state-persisting
futures upon receiving abortion notification. This way
more precise metrics are available (compared to reporting
from AsyncCheckpointRunnable after cancellation).
@rkhachatryan
Copy link
Contributor Author

Thanks for the review @pnowojski .
I've added the space and created a ticket to translate the docs.
I've also squashed the commits.

for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure

As discussed offline, this happens because the failed upstream doesn't sent barrier downstream.

@pnowojski pnowojski merged commit 6e77cfd into apache:master Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants