Skip to content

Conversation

@azagrebin
Copy link
Contributor

What is the purpose of the change

This PR fixes double counting of checkpoints in progress in their statistics.

Brief change log

  • If savepoint fails, restart checkpoint coordinator only if job is still running (prevents from double stop of checkpoint coordinator in case of global failure)
  • clear pending checkpoint in checkpoint coordinator stop method before aborting them (prevents from double aborting them if stop is called inside abort)
  • log error if number of checkpoints in progress is negative but do not throw exception and do not fail the job (prevents stats bugs from failing the job)

Verifying this change

submit DataStreamAllroundTestProgram in a loop in aws emr, wait for failure, check there was no negative number of checkpoints logs

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @azagrebin. I had some minor comments and a question. Before merging this PR it would be great if you could address them.

numFailedCheckpoints);
}

private boolean assertDecrementOfInProgressCheckpointsNumber() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename method into canDecrementOfInProgressCheckpointsNumber.

String errorMessage = "Incremented the completed number of checkpoints " +
"without incrementing the in progress checkpoints before.";
LOG.warn(errorMessage);
LOG.debug("Inconsistent CheckpointStatsCounts", new IllegalStateException(errorMessage));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this logging statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, debugging leftover

}, getMainThreadExecutor())
.exceptionally(throwable -> {
if (cancelJob) {
if (cancelJob && executionGraph.getState() == JobStatus.RUNNING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What situation are we filtering out with this change?

new ArrayList<>(pendingCheckpoints.values());
pendingCheckpoints.clear();

for (PendingCheckpoint p : pendingCheckpointsSnapshot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note, I guess we could have solved the problem also by changing JobMaster.java:965-975 into a handleAsync part because the problem seems to be that by aborting a PendingCheckpoint we fail the savepoint future which is then directly triggers the execution of the exceptionally handler. By decoupling the exceptionally handler from the stopCheckpointScheduler call, there should not be any concurrency problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this fix will solve the original problem without other changes.

@azagrebin
Copy link
Contributor Author

Thank you for the review @tillrohrmann ! I addressed the comments.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @azagrebin. LGTM. Merging.

@tillrohrmann
Copy link
Contributor

Manually merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants