Skip to content

Conversation

@Zakelly
Copy link
Contributor

@Zakelly Zakelly commented Jul 10, 2024

What is the purpose of the change

There is an issue of reusing wrong restored files in following cp when enabling file-merging. All the files from previous job should not be reused afterwards. This PR fixes that.

Brief change log

  • When creating/referencing a PhysicalFile during recovery, mark it unable to reuse.

Verifying this change

This change is already covered by existing IT tests, such as ResumeCheckpointManuallyITCase.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 10, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Zakelly for the quick fix~

I run ResumeCheckpointManuallyITCase localy multilple times with this PR, it works well.

LGTM assuming the CI is green.

@fredia
Copy link
Contributor

fredia commented Jul 10, 2024

Thanks for the PR, couldReuse does have a problem, the fix is LGTM.

But for the failed test testSwitchFromEnablingToDisablingFileMerging, the check fail at 3rd checkpoint, I think it has something to do with the test itself.

When restoring a job from file-merging disabled to file-merging disabled in CLAIM mode, the restored state handle may be reused by the subsequent checkpoint, it is uncertain when the restored files will actually be deleted.

Update: I found that the test itself is fixed in #25066 🚀

@Zakelly Zakelly merged commit 2cad548 into apache:master Jul 11, 2024
@Zakelly Zakelly deleted the f35803 branch July 11, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants