Skip to content

Fix ZkHelixPropertyStore loses Zookeeper notification issue#924

Merged
jiajunwang merged 10 commits intoapache:masterfrom
kaisun2000:PropertyStoreLosingCallback
Apr 14, 2020
Merged

Fix ZkHelixPropertyStore loses Zookeeper notification issue#924
jiajunwang merged 10 commits intoapache:masterfrom
kaisun2000:PropertyStoreLosingCallback

Conversation

@kaisun2000
Copy link
Contributor

@kaisun2000 kaisun2000 commented Mar 31, 2020

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Fixes #921

Description

  • Here are some details about my PR, including screenshots of any UI changes:

    ZkHelixPropertyStore loses ZK notification after session expires.
    The issue was caused by a bug in Share ZkClient code path. More
    specifically, Share ZkClient would not call fireAllEvent when ZK
    session expires. Thus, ZkHelixPropertyStore would not install
    watches for corresponding ZkPath. Thus, lose Zookeeper
    nofiticaition when changes happens.

Tests

  • The following tests are written for this issue:

testSessionExpirationWithSharedZkClient

  • The following is the result of the "mvn test" command on the appropriate module:

ksun-mn1:helix-core ksun$ mvn test

Failed tests:
TestZkConnectionLost.testLostZkConnection » ThreadTimeout Method org.testng.in...
TestJobFailureDependence.testWorkflowFailureJobThreshold » ThreadTimeout Metho...
TestHelixAdminCli.testInstanceOperations:469 » ZkClient Failed to delete /Test...

Tests run: 1093, Failures: 3, Errors: 0, Skipped: 4

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:37 h
[INFO] Finished at: 2020-04-06T18:55:19-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project helix-core: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/ksun/dev_branch_helix/helix/helix-core/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Individual test case run will all succeed.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

ZkHelixPropertyStore loses ZK notification after session expires.
THe issue was caused by a bug in Share ZkClient code path. More
specifically, Share ZkClient would not call fireAllEvent when ZK
session expires. Thus, ZkHelixPropertyStore would not install
watches for corresponding ZkPath. Thus, lose Zookeeper
nofiticaition when changes happens.
Copy link
Contributor

@jiajunwang jiajunwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the main issue is about the additional public method for fetching the client. I don't like that either.
An alternative way is that, since you are for sure given a shared zkclient in that property store, you can call ShareZkClientFactory with the same parameters and it will return you with another shared ZkClient instance. But, since they will use the same connection manager, when you expires the newly created client, the one that in the HelixPropertyStore will also be expired for once.
Can you have a try? In this way, no new method is required.

@kaisun2000
Copy link
Contributor Author

kaisun2000 commented Mar 31, 2020

@jiajunwang

I see the main issue is about the additional public method for fetching the client. I don't like that either.
An alternative way is that, since you are for sure given a shared zkclient in that property store, you can call ShareZkClientFactory with the same parameters and it will return you with another shared ZkClient instance. But, since they will use the same connection manager, when you expires the newly created client, the one that in the HelixPropertyStore will also be expired for once.
Can you have a try? In this way, no new method is required.

How do we get the zkconnection manager from the new sharedZkClient? It is not public.

If we make zkConnectionManager protected, we need to access it from the module in zookeeper-api, which means move (rewrite) this test case there. But this case the test case uses the code from HelixPropertyStore, which is helix-core. This is circular dependency.

@jiajunwang
Copy link
Contributor

@jiajunwang

I see the main issue is about the additional public method for fetching the client. I don't like that either.
An alternative way is that, since you are for sure given a shared zkclient in that property store, you can call ShareZkClientFactory with the same parameters and it will return you with another shared ZkClient instance. But, since they will use the same connection manager, when you expires the newly created client, the one that in the HelixPropertyStore will also be expired for once.
Can you have a try? In this way, no new method is required.

How do we get the zkconnection manager from the new sharedZkClient? It is not public.

If we make zkConnectionManager protected, we need to access it from the module in zookeeper-api, which means move (rewrite) this test case there. But this case use the code from HelixPropertyStore, which is helix-core. This is circular dependency.

Why you need to get zkconnection manager? The object you operate on is the send shared zkclient that you get from the factory.

@kaisun2000
Copy link
Contributor Author

As discussed offline, ZkTestHelper.expireSession() only works with zkConnectionmanager, but not shared zkclient.

But we can add another expireSharedZkClientSession() though for ZkTestHelper.

Kai Sun added 4 commits April 1, 2020 00:10
1/ move the test from TestZkHelixPropertyStore to TestZkCacheAsyncOpSingleThread
This is to use TestHelper.VerifyWithTimeout()
2/ add expireSession with SharedZkClient featue in ZkTestHelper and remove all
the public accessor to get ZkClient for test purpose
use the concise logic in zkClient.
@kaisun2000 kaisun2000 changed the title WIP: fix ZkHelixPropertyStore loses Zookeeper notification issue Fix ZkHelixPropertyStore loses Zookeeper notification issue Apr 7, 2020
Copy link
Contributor

@huizhilu huizhilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaisun2000 There are a few places that need to be changed. But the threads are resolved without any changes. Can you take a look again? Thanks.

Copy link
Contributor

@jiajunwang jiajunwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jiajunwang
Copy link
Contributor

@kaisun2000, please follow the merge steps to finish this PR merge process if you think it is good enough.

@pkuwm, please take another look to ensure this looks good to you.

@kaisun2000
Copy link
Contributor Author

This PR is ready to be merged, approved by @jiajunwang

Final message:
fix ZkHelixPropertyStore loses Zookeeper notification issue

ZkHelixPropertyStore loses ZK notification after session expires.
THe issue was caused by a bug in Share ZkClient code path. More
specifically, Share ZkClient would not call fireAllEvent when ZK
session expires. Thus, ZkHelixPropertyStore would not install
watches for corresponding ZkPath. Thus, lose Zookeeper
nofiticaition when changes happens.

@kaisun2000 kaisun2000 closed this Apr 11, 2020
@kaisun2000 kaisun2000 reopened this Apr 11, 2020
Copy link
Contributor

@huizhilu huizhilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@huizhilu
Copy link
Contributor

@kaisun2000 Just a reminder: please run the tests in an appropriate module before merging the PR. Thanks.

@kaisun2000
Copy link
Contributor Author

I did run another round of mvn test; same results. 4 flaking test. Running independently they succeed. The same as before this fix and added test,

@jiajunwang jiajunwang merged commit 108dfc6 into apache:master Apr 14, 2020
huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
ZkHelixPropertyStore loses ZK notification after session expires.
THe issue was caused by a bug in Share ZkClient code path. More
specifically, Share ZkClient would not call fireAllEvent when ZK
session expires. Thus, ZkHelixPropertyStore would not install
watches for corresponding ZkPath. Thus, lose Zookeeper
nofiticaition when changes happens.

Co-authored-by: Kai Sun <ksun@ksun-mn1.linkedin.biz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ZkHelixPropertyStore miss ZK notification

5 participants