Skip to content

Conversation

@wangyang0918
Copy link
Contributor

What is the purpose of the change

If something is temporarily wrong with the JobManager network, Fabric8FlinkKubeClient#checkAndUpdateConfigMap failed with KubernetesException in the first run and retried again. However, the http request is actually sent successfully and handled by the K8s APIServer, which means the entry was added to the ConfigMap. This will cause the second retry fails with AlreadyExistException and then discard the state. If the JobManager crashed exactly, it will throw the FileNotFoundException: No such file or directory: s3://xxx/flink-ha/xxx/completedCheckpoint72e30229420c in the following attempts since added entry is not cleaned up.

By make the AlreadyExistException in KubernetesStateHandleStore#addAndLock caused by PossibleInconsistentStateException we could avoid discarding the state.

Brief change log

  • Do not discard state when the AlreadyExistException is caused by retries

Verifying this change

  • Add a new unit test testAddWithAlreadyExistExceptionCausedByRetriesShouldNotDiscardState

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 16, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@wangyang0918
Copy link
Contributor Author

cc @rmetzger Would you like to have a look on this PR?

@rmetzger
Copy link
Contributor

Thanks a lot for fixing this!

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangyang0918 for looking into it and @rmetzger for checking the PR. I had a glimpse into it as well and was wondering whether we actually need to expose an error in that case. Please find my comment below...

@wangyang0918
Copy link
Contributor Author

Push a new fix which ignores the AlreadyExistException instead of throwing PossiblyInconsistentStateException.
@XComp Would you like to have a look again?

@XComp
Copy link
Contributor

XComp commented Aug 19, 2022

@wangyang0918 thanks the change and sorry for reiterating over it. I should have investigated "the other direction" already yesterday. I'm wondering whether we could make the addEntry call idempotent. Hence, the caller doesn't have to deal with it. We would just have to do a equality check on the content before throwing the AlreadyExistException in KubernetesStateHandleStore:663. WDYT?

FYI: The ZooKeeper implementation handles the very same problem around retry handling (see ZooKeeper:185)

@wangyang0918
Copy link
Contributor Author

@XComp Thanks for the nice suggestion. I will integrate your comments soon.

We should not care whether the duplicated entry is caused by retries or something else if the content is same as the contentToBeAdded.

@wangyang0918
Copy link
Contributor Author

@XComp I have addressed your comments. Please have a look.

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass over the PR. The change looks good. I just had a few minor cosmetic comments. WDYT?

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@XComp
Copy link
Contributor

XComp commented Aug 23, 2022

Something odd happened with the most recent build for that branch. But it looks like it's unrelated to the change. I'm wondering whether there's something wrong with AlibabaCI002-agent01. Publishing the build artifacts in compile_ci timed out. The upload didn't make any progress. Do you have access to the machine, @wangyang0918?

@wangyang0918
Copy link
Contributor Author

@XComp I will try to login and find out what's happening on the AlibabaCI002-agent01.

@wangyang0918
Copy link
Contributor Author

@flinkbot run azure

@wangyang0918
Copy link
Contributor Author

wangyang0918 commented Aug 24, 2022

From the alicloud ECS monitoring, I didn't find any network issues in the AlibabaCI002. I still could not manage to login to do a further analysis. And it seems that other agents on the same machine work well.

@wangyang0918 wangyang0918 merged commit aae96d0 into apache:master Aug 24, 2022
huangxiaofeng10047 pushed a commit to huangxiaofeng10047/flink that referenced this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants