[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

rkhachatryan · 2021-01-13T21:06:19Z

What is the purpose of the change

Update checkpoint statistics (shown in the web UI) even after a checkpoint fails
(this would facilitate investigation of issues with slow checkpointing).

With this change, failed checkpoint stats is updated when:

Subtask acks a checkpoint too late or after some other failure. AsyncCheckpointRunnable completes normally and reports snapshot as usual. CheckpointCoordinator was updated to handle these calls
Subtask receives abortion notification and cancels the runnable before it completes. In this case it only reports the metrics. Both TM and JM sides were updated and a new RPC added

Verifying this change

This change added tests and can be verified as follows:

CheckpointCoordinatorTest.testCheckpointStatsUpdatedAfterFailure
CheckpointCoordinatorTest.testAbortedCheckpointStatsUpdatedAfterFailure
Manually verified the change by running DataStreamAllroundTestProgram on local cluser:

execution.checkpointing.interval: 10s
execution.checkpointing.min-pause: 1s
execution.checkpointing.timeout: 1s
execution.checkpointing.tolerable-failed-checkpoints: 1000000
execution.checkpointing.unaligned: true
taskmanager.numberOfTaskSlots: 8
web.checkpoints.history: 100

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2021-01-13T21:09:24Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit bd8d680 (Fri May 28 08:06:00 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-01-13T21:27:20Z

CI report:

bd8d680 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

pnowojski

Thanks for the change @rkhachatryan . I've left a couple of comments and questions (I might haven't fully understood some things)

...src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpointStats.java

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java

flink-runtime/src/main/java/org/apache/flink/runtime/state/StateUtil.java

...ing-java/src/main/java/org/apache/flink/streaming/api/operators/OperatorSnapshotFutures.java

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/FailedCheckpointStats.java

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SubtaskStateStats.java

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java

rkhachatryan · 2021-01-14T17:02:12Z

Thanks for reviewing, @pnowojski.
I've addressed your feedback, PTAL.

…arguments This is a preparatory step for FLINK-19462. CheckpointStatsTracker is created with a fixed set of vertices. At time of checkpoint creation this set can be different. As checkpoint already carries the vertices there is no need to store them as state. Besides that, changing the type from List<ExecutionJobVertex> to Map<JobVertexID, Integer> simplifies writing the tests.

This is a preparatory step for FLINK-19462. Motivation: 1. Ability to report metrics without state snapshot (subsequent commit) 2. Consistency with other metrics

This is a preparatory step for FLINK-19462.

…stent This is a preparatory step for FLINK-19462. Access to checkpoint stats by ID without calling history.snapshot() allows to update failed checkpoint stats without PendingCheckpoint instance.

…heckpointStats Inherit mutation semantics from PendingCheckpointStats to allow updates even after checkpoint failed.

pnowojski

I've tried out those changes and I think they are not working well as it is now. I've modified WordCount example to randomly timeout async checkpoint phase and I wanted to see what does it look like.

Two screenshots: https://imgur.com/a/sv6w2IJ

With this PR there is no way to know which task has failed checkpoint. Also async phase timed out (5 seconds) and the end to end duration is in milliseconds range for all tasks/subtasks? Async duration is also in [0ms, 5ms] range for all tasks, even the one that has timed out.

I think it's strictly necessary to:

clearly mark which checkpoint for which subtask has failed
if we were not able to collect/calculate a metric, it must be N/A - not just 0ms
I think it's almost must have to:
correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A

Otherwise this is very misleading :(

(I haven't tried it with different kind of failures, but this should be tried out before merging)

rkhachatryan · 2021-01-20T08:03:17Z

Thanks a lot for trying it out.

I think it's strictly necessary to:
clearly mark which checkpoint for which subtask has failed

It is not always the task that fails a checkpoint. Timeout decision is made by the CheckpointCoordinator.
Multiple tasks can fail independently as well.
I agree that marking "failed" tasks would be useful but I don't think it's directly related to this feature or at least this PR.

if we were not able to collect/calculate a metric, it must be N/A - not just 0ms

I don't see 0ms on your screenshots nor while running locally. Do you mean 0 B per operator?
If so, why is it incorrect? (I do see non-zero size running cluster).

correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A

A checkpoint can be cancelled before even being started on some subtasks.

rkhachatryan · 2021-01-22T14:00:22Z

I've updated the PR (adding 4 new commits):

Tasks reporting upon abort RPC are marked as aborted in e2e duration column
Only tasks that actually ACKed checkpoint are counted for ackCount and lastAckTime
-1B is shown as - (the same way as durations)
Fix the docs

cc: @NicoK

NicoK · 2021-01-22T16:08:21Z

So, as soon as we are through the sync phase, we will get stats (if the CP is aborted during the sync phase, that won't interrupt the sync part anyway and will wait for it to complete). If we didn't reach the sync phase yet, the timeout could be because of slowly moving barriers (no barrier was received yet) or slow alignment (some barriers received but not all). These could be derived from looking at backpressure or data skew or starting times of other subtasks or timings from previous subtasks.

I think, the current state is a good step forward and the stats look good 👍

pnowojski

Thanks for the update @rkhachatryan. % some minor comments, there is potentially one more issue. If for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure. In that case I see only n/a for the whole subtask, which is a bit inconsistent with timeout behaviour.

...b/web-dashboard/src/app/pages/job/checkpoints/subtask/job-checkpoints-subtask.component.html

docs/ops/monitoring/checkpoint_monitoring.md

This change introduces a new RPC from TM to JM. Existing one can't be used because it: a) confirms the checkpoint b) requires task state snapshot The call is issued after cancelling task state-persisting futures upon receiving abortion notification. This way more precise metrics are available (compared to reporting from AsyncCheckpointRunnable after cancellation).

…essage

rkhachatryan · 2021-01-25T10:14:13Z

Thanks for the review @pnowojski .
I've added the space and created a ticket to translate the docs.
I've also squashed the commits.

for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure

As discussed offline, this happens because the failed upstream doesn't sent barrier downstream.

rmetzger added the review=description? label Jan 13, 2021

rmetzger added component=Runtime/Metrics component=Runtime/Checkpointing labels Jan 13, 2021

pnowojski reviewed Jan 14, 2021

View reviewed changes

rkhachatryan force-pushed the f19462-v2 branch 2 times, most recently from 0a44c3a to fcc06b3 Compare January 14, 2021 17:00

rkhachatryan force-pushed the f19462-v2 branch 2 times, most recently from 026f0b5 to 2e65254 Compare January 15, 2021 00:53

rkhachatryan added 6 commits January 19, 2021 08:49

[hotfix][checkpointing] Log checkpoint ID from notification

4d331f5

[hotfix][checkpointing] Send state snapshot size as a metric

751194f

This is a preparatory step for FLINK-19462. Motivation: 1. Ability to report metrics without state snapshot (subsequent commit) 2. Consistency with other metrics

[hotfix][checkpointing] Collect and log discarded state size

6fdd3fa

This is a preparatory step for FLINK-19462.

[FLINK-19462][checkpointing] Make CheckpointStatsHistory always consi…

259a4d6

…stent This is a preparatory step for FLINK-19462. Access to checkpoint stats by ID without calling history.snapshot() allows to update failed checkpoint stats without PendingCheckpoint instance.

[FLINK-19462][checkpointing] Extend PendingCheckpointStats by FailedC…

28fd3c8

…heckpointStats Inherit mutation semantics from PendingCheckpointStats to allow updates even after checkpoint failed.

pnowojski requested changes Jan 20, 2021

View reviewed changes

rkhachatryan force-pushed the f19462-v2 branch from 2e65254 to 8900426 Compare January 22, 2021 13:53

pnowojski reviewed Jan 25, 2021

View reviewed changes

...b/web-dashboard/src/app/pages/job/checkpoints/subtask/job-checkpoints-subtask.component.html Outdated Show resolved Hide resolved

docs/ops/monitoring/checkpoint_monitoring.md Outdated Show resolved Hide resolved

rkhachatryan added 5 commits January 25, 2021 11:06

[FLINK-19462][checkpointing] Update checkpoint stats from late ACKs

1762073

[hotfix][runtime] Extract SchedulerBase.processCheckpointCoordinatorM…

48b3f10

…essage

[hotfix][runtime] Remove unused SchedulerBase.checkpointRecoveryFactory

36e8221

[hotfix][web] Show - in humanizeBytes for negative values

bd8d680

rkhachatryan force-pushed the f19462-v2 branch from 3618093 to bd8d680 Compare January 25, 2021 10:09

pnowojski approved these changes Jan 25, 2021

View reviewed changes

pnowojski merged commit 6e77cfd into apache:master Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

rkhachatryan commented Jan 13, 2021

flinkbot commented Jan 13, 2021 •

edited

flinkbot commented Jan 13, 2021 •

edited

pnowojski left a comment

rkhachatryan commented Jan 14, 2021

pnowojski left a comment •

edited

rkhachatryan commented Jan 20, 2021

rkhachatryan commented Jan 22, 2021

NicoK commented Jan 22, 2021

pnowojski left a comment

rkhachatryan commented Jan 25, 2021

[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

[FLINK-19462][checkpointing] Update failed checkpoint stats #14635

Conversation

rkhachatryan commented Jan 13, 2021

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jan 13, 2021 • edited

Automated Checks

Review Progress

flinkbot commented Jan 13, 2021 • edited

CI report:

pnowojski left a comment

Choose a reason for hiding this comment

rkhachatryan commented Jan 14, 2021

pnowojski left a comment • edited

Choose a reason for hiding this comment

rkhachatryan commented Jan 20, 2021

rkhachatryan commented Jan 22, 2021

NicoK commented Jan 22, 2021

pnowojski left a comment

Choose a reason for hiding this comment

rkhachatryan commented Jan 25, 2021

flinkbot commented Jan 13, 2021 •

edited

flinkbot commented Jan 13, 2021 •

edited

pnowojski left a comment •

edited