-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-28265][k8s] Do not discard state when the AlreadyExistException is caused by retries #20590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1c195f3 to
6414a42
Compare
|
cc @rmetzger Would you like to have a look on this PR? |
|
Thanks a lot for fixing this! |
XComp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangyang0918 for looking into it and @rmetzger for checking the PR. I had a glimpse into it as well and was wondering whether we actually need to expose an error in that case. Please find my comment below...
...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java
Outdated
Show resolved
Hide resolved
6414a42 to
b55b842
Compare
|
Push a new fix which ignores the |
|
@wangyang0918 thanks the change and sorry for reiterating over it. I should have investigated "the other direction" already yesterday. I'm wondering whether we could make the FYI: The ZooKeeper implementation handles the very same problem around retry handling (see ZooKeeper:185) |
|
@XComp Thanks for the nice suggestion. I will integrate your comments soon. We should not care whether the duplicated entry is caused by retries or something else if the content is same as the |
b55b842 to
6414a42
Compare
6414a42 to
4146121
Compare
|
@XComp I have addressed your comments. Please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass over the PR. The change looks good. I just had a few minor cosmetic comments. WDYT?
...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStore.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStoreTest.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStoreTest.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/kubernetes/highavailability/KubernetesStateHandleStoreTest.java
Outdated
Show resolved
Hide resolved
4146121 to
abfa850
Compare
XComp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
|
Something odd happened with the most recent build for that branch. But it looks like it's unrelated to the change. I'm wondering whether there's something wrong with |
|
@XComp I will try to login and find out what's happening on the |
|
@flinkbot run azure |
|
From the alicloud ECS monitoring, I didn't find any network issues in the |
What is the purpose of the change
If something is temporarily wrong with the JobManager network,
Fabric8FlinkKubeClient#checkAndUpdateConfigMapfailed withKubernetesExceptionin the first run and retried again. However, the http request is actually sent successfully and handled by the K8s APIServer, which means the entry was added to the ConfigMap. This will cause the second retry fails withAlreadyExistExceptionand then discard the state. If the JobManager crashed exactly, it will throw theFileNotFoundException: No such file or directory: s3://xxx/flink-ha/xxx/completedCheckpoint72e30229420cin the following attempts since added entry is not cleaned up.By make the
AlreadyExistExceptioninKubernetesStateHandleStore#addAndLockcaused byPossibleInconsistentStateExceptionwe could avoid discarding the state.Brief change log
Verifying this change
testAddWithAlreadyExistExceptionCausedByRetriesShouldNotDiscardStateDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation