[FLINK-28265][k8s] Do not discard state when the AlreadyExistException is caused by retries #20590

wangyang0918 · 2022-08-16T06:50:14Z

What is the purpose of the change

If something is temporarily wrong with the JobManager network, Fabric8FlinkKubeClient#checkAndUpdateConfigMap failed with KubernetesException in the first run and retried again. However, the http request is actually sent successfully and handled by the K8s APIServer, which means the entry was added to the ConfigMap. This will cause the second retry fails with AlreadyExistException and then discard the state. If the JobManager crashed exactly, it will throw the FileNotFoundException: No such file or directory: s3://xxx/flink-ha/xxx/completedCheckpoint72e30229420c in the following attempts since added entry is not cleaned up.

By make the AlreadyExistException in KubernetesStateHandleStore#addAndLock caused by PossibleInconsistentStateException we could avoid discarding the state.

Brief change log

Do not discard state when the AlreadyExistException is caused by retries

Verifying this change

Add a new unit test testAddWithAlreadyExistExceptionCausedByRetriesShouldNotDiscardState

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2022-08-16T06:58:51Z

CI report:

abfa850 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

wangyang0918 · 2022-08-16T08:44:18Z

cc @rmetzger Would you like to have a look on this PR?

rmetzger · 2022-08-18T10:48:55Z

Thanks a lot for fixing this!

XComp

Thanks @wangyang0918 for looking into it and @rmetzger for checking the PR. I had a glimpse into it as well and was wondering whether we actually need to expose an error in that case. Please find my comment below...

...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java

wangyang0918 · 2022-08-19T02:31:38Z

Push a new fix which ignores the AlreadyExistException instead of throwing PossiblyInconsistentStateException.
@XComp Would you like to have a look again?

XComp · 2022-08-19T09:47:43Z

@wangyang0918 thanks the change and sorry for reiterating over it. I should have investigated "the other direction" already yesterday. I'm wondering whether we could make the addEntry call idempotent. Hence, the caller doesn't have to deal with it. We would just have to do a equality check on the content before throwing the AlreadyExistException in KubernetesStateHandleStore:663. WDYT?

FYI: The ZooKeeper implementation handles the very same problem around retry handling (see ZooKeeper:185)

wangyang0918 · 2022-08-19T10:48:55Z

@XComp Thanks for the nice suggestion. I will integrate your comments soon.

We should not care whether the duplicated entry is caused by retries or something else if the content is same as the contentToBeAdded.

This closes apache#20590.

wangyang0918 · 2022-08-22T08:08:24Z

@XComp I have addressed your comments. Please have a look.

XComp

I did another pass over the PR. The change looks good. I just had a few minor cosmetic comments. WDYT?

...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java

...c/test/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStoreTest.java

This closes apache#20590.

XComp

LGTM 👍

XComp · 2022-08-23T14:52:04Z

Something odd happened with the most recent build for that branch. But it looks like it's unrelated to the change. I'm wondering whether there's something wrong with AlibabaCI002-agent01. Publishing the build artifacts in compile_ci timed out. The upload didn't make any progress. Do you have access to the machine, @wangyang0918?

wangyang0918 · 2022-08-24T03:34:25Z

@XComp I will try to login and find out what's happening on the AlibabaCI002-agent01.

wangyang0918 · 2022-08-24T03:46:29Z

@flinkbot run azure

wangyang0918 · 2022-08-24T07:41:09Z

From the alicloud ECS monitoring, I didn't find any network issues in the AlibabaCI002. I still could not manage to login to do a further analysis. And it seems that other agents on the same machine work well.

This closes apache#20590.

flinkbot added the component=Runtime/Coordination label Aug 16, 2022

wangyang0918 force-pushed the FLINK-28265-1 branch from 1c195f3 to 6414a42 Compare August 16, 2022 08:31

rmetzger approved these changes Aug 18, 2022

View reviewed changes

XComp reviewed Aug 18, 2022

View reviewed changes

...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java Outdated Show resolved Hide resolved

wangyang0918 force-pushed the FLINK-28265-1 branch from 6414a42 to b55b842 Compare August 19, 2022 02:27

wangyang0918 requested a review from XComp August 19, 2022 02:28

wangyang0918 force-pushed the FLINK-28265-1 branch from b55b842 to 6414a42 Compare August 22, 2022 08:03

wangyang0918 added a commit to wangyang0918/flink that referenced this pull request Aug 22, 2022

[FLINK-28265][k8s] Make KubernetesStateHandleStore#addEntry idempotent

4146121

This closes apache#20590.

wangyang0918 force-pushed the FLINK-28265-1 branch from 6414a42 to 4146121 Compare August 22, 2022 08:05

XComp requested changes Aug 22, 2022

View reviewed changes

[FLINK-28265][k8s] Make KubernetesStateHandleStore#addEntry idempotent

abfa850

This closes apache#20590.

wangyang0918 force-pushed the FLINK-28265-1 branch from 4146121 to abfa850 Compare August 22, 2022 15:14

wangyang0918 requested a review from XComp August 22, 2022 15:18

XComp approved these changes Aug 23, 2022

View reviewed changes

wangyang0918 mentioned this pull request Aug 24, 2022

[BP-1.15][FLINK-28265][k8s] Make KubernetesStateHandleStore#addEntry idempotent #20673

Merged

wangyang0918 merged commit aae96d0 into apache:master Aug 24, 2022

huangxiaofeng10047 pushed a commit to huangxiaofeng10047/flink that referenced this pull request Nov 3, 2022

[FLINK-28265][k8s] Make KubernetesStateHandleStore#addEntry idempotent

c8d48d8

This closes apache#20590.

[FLINK-28265][k8s] Do not discard state when the AlreadyExistException is caused by retries #20590

[FLINK-28265][k8s] Do not discard state when the AlreadyExistException is caused by retries #20590

Uh oh!

Conversation

wangyang0918 commented Aug 16, 2022

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

wangyang0918 commented Aug 16, 2022

Uh oh!

rmetzger commented Aug 18, 2022

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangyang0918 commented Aug 19, 2022

Uh oh!

XComp commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyang0918 commented Aug 19, 2022

Uh oh!

wangyang0918 commented Aug 22, 2022

Uh oh!

XComp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

XComp commented Aug 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyang0918 commented Aug 24, 2022

Uh oh!

wangyang0918 commented Aug 24, 2022

Uh oh!

wangyang0918 commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Aug 16, 2022 •

edited

Loading

XComp commented Aug 19, 2022 •

edited

Loading

XComp left a comment •

edited

Loading

XComp commented Aug 23, 2022 •

edited

Loading

wangyang0918 commented Aug 24, 2022 •

edited

Loading