Add async call retry to resolve the transient ZK connection issue. by jiajunwang · Pull Request #970 · apache/helix

jiajunwang · 2020-04-23T23:25:10Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Description

Here are some details about my PR, including screenshots of any UI changes:

If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result.
This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED).
So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.

Tests

The following tests are written for this issue:

TestZkClientAsyncRetry

Keep running 75 times to ensure it's stable.
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.558 s - in org.apache.helix.zookeeper.impl.client.TestZkClientAsyncRetry
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.896 s
[INFO] Finished at: 2020-04-25T15:56:37-07:00
[INFO] ------------------------------------------------------------------------
//======================================================================
Attempt 75 TestZkClientAsyncRetry
//======================================================================

The following is the result of the "mvn test" command on the appropriate module:

zookeeper-api

[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.118 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.684 s
[INFO] Finished at: 2020-04-25T16:20:32-07:00
[INFO] ------------------------------------------------------------------------

helix-core

[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4,621.077 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:17 h
[INFO] Finished at: 2020-04-27T12:34:24-07:00
[INFO] ------------------------------------------------------------------------

helix-rest

[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 193.83 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:19 min
[INFO] Finished at: 2020-04-25T16:51:55-07:00
[INFO] ------------------------------------------------------------------------

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

junkaixue · 2020-04-28T21:05:55Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java

+              data == null ? 0 : data.length, false) {
+            @Override
+            protected void doRetry() {
+              doAsyncSetData(path, data, version, System.currentTimeMillis(), cb);


Is this recursively self calling OK?

This is the design. If no connectivity issue, it won't be triggered. The assumption here is that the connectivity issue is transient and won't happen continuously.

I am not quite familiar with the way you use. I just wonder would that cause infinite call then stack overflow?

If I say it won't, I guess you won't believe it so easily. Please take a look at the code carefully. In general, it is not a recursive call, it is a callback triggered in a different thread after this method is done. Even we keep retrying, only one call exists in the stack.

junkaixue · 2020-04-28T21:12:56Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java

+            LOG.error("Failed to request to retry the operation.", t);
+          }
+        } else {
+          LOG.warn(


Would this be to many introduced for log?

I don't think so.
The first two logs only happen when retry is not possible because of some unknown error. We need to know in this case.
The third one happens if Helix devs changed the Helix code in a strange way. It never happens with the existing PR code.
Overall, if everything works fine, we won't see any error messages.

about "Helix devs changed the Helix code in a strange way":
It would be great if you can identify some of the places that code change can cause the issue and put some comments in that parts? so people will be careful about their changes that they are making later.

@alirezazamani I understand your concern. But there are many many ways to make it happen. I won't be able to comment about every case. Moreover, strictly speaking, this final check is not like an NPE check, this is not a failure case. This is just meaning that the caller does not want a retry. And this is not something we have to avoid, if logic requires, we shall still do it. It is just not happening, and a little bit strang for the current logic. But we shall not discourage people to use it in a different way in the future. At that moment, if this log becomes too verbose, we can downgrade it or remove it.

alirezazamani

Good job. If we are sure that we will not be stuck in retying and we are not causing stack overflow (which I believe you already explained the reason in the reviews), I don't see any issue with this PR. However, since this is critical part of the code, I would suggest to wait to get feedbacks from few other members before merging.

The current asyn callback will fail the operation and may return partial results eventually, if any exceptions happen during the call. This change will make the ZkClient retry on the temporary ZK connection issues (CONNECTIONLOSS, OPERATIONTIMEOUT. SESSIONEXPIRED, SESSIONMOVED). So it has a larger chance to finish the operation if possible. Note that if the exception is due to business logic, the operation will still fail and the same return code will be sent to the callback handler.

jiajunwang · 2020-05-04T19:35:45Z

This PR is ready to be merged, approved by @alirezazamani @dasahcc
Also, re-run the helix-core and helix-rest tests, all passed.

…ssue. (#970)" This reverts commit 96ebb27.

…ection issue. (#970)"" This reverts commit 370e277.

kaisun2000 · 2020-05-05T06:24:50Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java

    _operationRetryTimeoutInMillis = operationRetryTimeout;
    _isNewSessionEventFired = false;

+    _asyncCallRetryThread = new ZkAsyncRetryThread(zkConnection.getServers());


We should give name of this thread that can be tied to the ZkEvent thread name. This way, when we debug it, we know the relation. Otherwise it would be very hard to correlate and reason.

Good point, let me do it in a separate PR.

Sorry I was confused by myself. Name already given in this PR.
"ZkClient-AsyncCallback-Retry-" + getId() + "-" + name.

kaisun2000 · 2020-05-05T06:55:27Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java

+        case CONNECTIONLOSS:
+          /** The session has been expired by the server */
+        case SESSIONEXPIRED:
+          /** Session moved to another server, so operation is ignored */


These Aync call is normally for batch access from ZkBaseDataAccessor I believe. Here, the idea is to not create ephemeral nodes because SESSIONEXPIRED can be retry. Then we should probably fail ephemeral code creating asyncly too, right?

I think we are not using the async call to create ephemeral nodes. But this might be a concern. If that becomes the case, let's make the same change for async call.

kaisun2000 · 2020-05-05T07:01:11Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java

-    int _rc = -1;
+  public static abstract class DefaultCallback implements CancellableZkAsyncCallback {
+    AtomicBoolean _isOperationDone = new AtomicBoolean(false);
+    int _rc = UNKNOWN_RET_CODE;


why change this value from -1 to 255?

-1 is a valid error code.

kaisun2000 · 2020-05-05T07:20:54Z

zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java

+                  "Cannot request to retry the operation. The retry request thread may have been stopped.");
+            }
+          } catch (Throwable t) {
+            LOG.error("Failed to request to retry the operation.", t);


Need retry, retri-able context, but retry operation failed? What to do here? Mark done, return some retriable RC value like CONNECTIONLOSS is not what the customer expect to handle right?

It is. Because retry is not possible. So we have to return with a result instead of letting the caller pending forever.

…pache#970) If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result. This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED). So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.

…ssue. (apache#970)" This reverts commit 96ebb27.

…ection issue. (apache#970)"" This reverts commit 370e277.

jiajunwang changed the title ~~[WIP] Add async call retry to resolve the transient ZK connection issue.~~ Add async call retry to resolve the transient ZK connection issue. Apr 25, 2020

jiajunwang force-pushed the retryOnAsync branch from 9a433bb to 10a8459 Compare April 25, 2020 23:52

jiajunwang requested a review from lei-xia April 28, 2020 05:41

junkaixue reviewed Apr 28, 2020

View reviewed changes

jiajunwang mentioned this pull request Apr 28, 2020

Enforce result check for data accessors batch get calls to prevent partial batch read. #974

Merged

7 tasks

alirezazamani approved these changes Apr 30, 2020

View reviewed changes

junkaixue approved these changes May 1, 2020

View reviewed changes

Jiajun Wang added 4 commits May 4, 2020 10:56

Cancel pending retries when ZkClient is closed.

7a8d4b1

Add test.

6a5e144

Refine the method definition.

0f5944b

jiajunwang force-pushed the retryOnAsync branch from 10a8459 to 0f5944b Compare May 4, 2020 18:08

jiajunwang merged commit 96ebb27 into apache:master May 4, 2020

jiajunwang deleted the retryOnAsync branch May 4, 2020 19:36

asfgit pushed a commit that referenced this pull request May 4, 2020

Revert "Add async call retry to resolve the transient ZK connection i…

370e277

…ssue. (#970)" This reverts commit 96ebb27.

asfgit pushed a commit that referenced this pull request May 5, 2020

Revert "Revert "Add async call retry to resolve the transient ZK conn…

eb4d99f

…ection issue. (#970)"" This reverts commit 370e277.

kaisun2000 reviewed May 5, 2020

View reviewed changes

jiajunwang mentioned this pull request May 5, 2020

We should give name of this thread that can be tied to the ZkEvent thread name. This way, when we debug it, we know the relation. Otherwise it would be very hard to correlate and reason. #997

Closed

huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020

Revert "Add async call retry to resolve the transient ZK connection i…

b30c013

…ssue. (apache#970)" This reverts commit 96ebb27.

huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020

Revert "Revert "Add async call retry to resolve the transient ZK conn…

cac0b8f

…ection issue. (apache#970)"" This reverts commit 370e277.

Conversation

jiajunwang commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Commits

Documentation (Optional)

Code Quality

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alirezazamani left a comment

Choose a reason for hiding this comment

Uh oh!

jiajunwang commented May 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiajunwang commented Apr 23, 2020 •

edited

Loading