Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-11850][zk] Tolerate concurrent child deletions when deleting owned zNode #7928

Merged
merged 3 commits into from Mar 7, 2019

Conversation

tillrohrmann
Copy link
Contributor

What is the purpose of the change

When calling ZooKeeperHaServices#closeAndCleanupAllData it can happen that a child of the owned
zNode of the ZooKeeperHaServices is being concurrently deleted (e.g. a LeaderElectionService has
been shut down). In order to tolerate concurrent deletions, we use now ZKPaths#deleteChildren.

Verifying this change

  • Covered by existing ZooKeeperHaServicesTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 7, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ✅ 1. The [description] looks good.
  • ✅ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ✅ 4. The change fits into the overall [architecture].
  • ✅ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :-)

@zentol
Copy link
Contributor

zentol commented Mar 7, 2019

Do you have a reference for the supposed behavior (failing if a child is deleted) that is not caused by ZKPaths.deleteChildren?

The only related issue I could only find is https://issues.apache.org/jira/browse/CURATOR-430.
Based on this client.delete().deletingChildrenIfNeeded().forPath("/") calls ZKPaths.deleteChildren internally (and the source seems to confirm that).
This issue was fixed by curator within ZKPaths.deleteChildren as well. (pr)
The fix was not backported to the version we're using.

@zentol zentol self-assigned this Mar 7, 2019
@tillrohrmann
Copy link
Contributor Author

I think you are right @zentol. Our curator version 2.12.0 does not fix the problem and it looked only like this when running the tests a couple of times. I think the only solution atm is to handle this exception at the call site and retry the operation until it succeeds. I will update the PR accordingly.

@tisonkun
Copy link
Member

tisonkun commented Mar 7, 2019

@tillrohrmann isn't it an option that we bump our shaded curator version?

@tillrohrmann
Copy link
Contributor Author

@tisonkun I wouldn't do this so close to the actual release. Rather, I prefer to do this after the 1.8 release to give it a bit more exposure.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minor comment.

@flinkbot approve all

client.delete().deletingChildrenIfNeeded().forPath("/");
zNodeDeleted = true;
} catch (KeeperException.NoNodeException ignored) {
// concurrent delete operation. Try again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could log this on debug just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will add it.

…ission

This commit changes the cleanup logic of the Dispatcher to only clean up job HA files
if the job is not a duplicate (meaning that it is either running or has already been
executed by the same JobMaster).

This closes apache#7918.
…nd MiniCluster

The io executor is responsible for running io operations like discarding checkpoints.
By using the io executor, we don't risk that the RpcService is blocked by blocking
io operations.

This closes apache#7924.
@tillrohrmann
Copy link
Contributor Author

Thanks for the review @tisonkun and @zentol. Addressing Chesnay's last comment and then merging this PR.

…wned zNode

When calling ZooKeeperHaServices#closeAndCleanupAllData it can happen that a child of the owned
zNode of the ZooKeeperHaServices is being concurrently deleted (e.g. a LeaderElectionService has
been shut down). In order to tolerate concurrent deletions, we use now ZKPaths#deleteChildren.

This closes apache#7928.
@asfgit asfgit merged commit b464df2 into apache:master Mar 7, 2019
@tillrohrmann tillrohrmann deleted the FLINK-11850 branch March 7, 2019 21:52
HuangZhenQiu pushed a commit to HuangZhenQiu/flink that referenced this pull request Apr 22, 2019
…wned zNode

When calling ZooKeeperHaServices#closeAndCleanupAllData it can happen that a child of the owned
zNode of the ZooKeeperHaServices is being concurrently deleted (e.g. a LeaderElectionService has
been shut down). In order to tolerate concurrent deletions, we use now ZKPaths#deleteChildren.

This closes apache#7928.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants