Skip to content

Add async call retry to resolve the transient ZK connection issue.#970

Merged
jiajunwang merged 4 commits intoapache:masterfrom
jiajunwang:retryOnAsync
May 4, 2020
Merged

Add async call retry to resolve the transient ZK connection issue.#970
jiajunwang merged 4 commits intoapache:masterfrom
jiajunwang:retryOnAsync

Conversation

@jiajunwang
Copy link
Contributor

@jiajunwang jiajunwang commented Apr 23, 2020

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#969

Description

  • Here are some details about my PR, including screenshots of any UI changes:

If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result.
This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED).
So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.

Tests

  • The following tests are written for this issue:

TestZkClientAsyncRetry

Keep running 75 times to ensure it's stable.
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.558 s - in org.apache.helix.zookeeper.impl.client.TestZkClientAsyncRetry
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.896 s
[INFO] Finished at: 2020-04-25T15:56:37-07:00
[INFO] ------------------------------------------------------------------------
//======================================================================
Attempt 75 TestZkClientAsyncRetry
//======================================================================

  • The following is the result of the "mvn test" command on the appropriate module:

zookeeper-api

[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.118 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.684 s
[INFO] Finished at: 2020-04-25T16:20:32-07:00
[INFO] ------------------------------------------------------------------------

helix-core

[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4,621.077 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:17 h
[INFO] Finished at: 2020-04-27T12:34:24-07:00
[INFO] ------------------------------------------------------------------------

helix-rest

[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 193.83 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:19 min
[INFO] Finished at: 2020-04-25T16:51:55-07:00
[INFO] ------------------------------------------------------------------------

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@jiajunwang jiajunwang changed the title [WIP] Add async call retry to resolve the transient ZK connection issue. Add async call retry to resolve the transient ZK connection issue. Apr 25, 2020
@jiajunwang jiajunwang requested a review from lei-xia April 28, 2020 05:41
data == null ? 0 : data.length, false) {
@Override
protected void doRetry() {
doAsyncSetData(path, data, version, System.currentTimeMillis(), cb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this recursively self calling OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the design. If no connectivity issue, it won't be triggered. The assumption here is that the connectivity issue is transient and won't happen continuously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite familiar with the way you use. I just wonder would that cause infinite call then stack overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I say it won't, I guess you won't believe it so easily. Please take a look at the code carefully. In general, it is not a recursive call, it is a callback triggered in a different thread after this method is done. Even we keep retrying, only one call exists in the stack.

LOG.error("Failed to request to retry the operation.", t);
}
} else {
LOG.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be to many introduced for log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.
The first two logs only happen when retry is not possible because of some unknown error. We need to know in this case.
The third one happens if Helix devs changed the Helix code in a strange way. It never happens with the existing PR code.
Overall, if everything works fine, we won't see any error messages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about "Helix devs changed the Helix code in a strange way":
It would be great if you can identify some of the places that code change can cause the issue and put some comments in that parts? so people will be careful about their changes that they are making later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alirezazamani I understand your concern. But there are many many ways to make it happen. I won't be able to comment about every case. Moreover, strictly speaking, this final check is not like an NPE check, this is not a failure case. This is just meaning that the caller does not want a retry. And this is not something we have to avoid, if logic requires, we shall still do it. It is just not happening, and a little bit strang for the current logic. But we shall not discourage people to use it in a different way in the future. At that moment, if this log becomes too verbose, we can downgrade it or remove it.

Copy link

@alirezazamani alirezazamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. If we are sure that we will not be stuck in retying and we are not causing stack overflow (which I believe you already explained the reason in the reviews), I don't see any issue with this PR. However, since this is critical part of the code, I would suggest to wait to get feedbacks from few other members before merging.

Jiajun Wang added 4 commits May 4, 2020 10:56
The current asyn callback will fail the operation and may return partial results eventually, if any exceptions happen during the call.
This change will make the ZkClient retry on the temporary ZK connection issues (CONNECTIONLOSS, OPERATIONTIMEOUT. SESSIONEXPIRED, SESSIONMOVED). So it has a larger chance to finish the operation if possible.
Note that if the exception is due to business logic, the operation will still fail and the same return code will be sent to the callback handler.
@jiajunwang
Copy link
Contributor Author

This PR is ready to be merged, approved by @alirezazamani @dasahcc
Also, re-run the helix-core and helix-rest tests, all passed.

@jiajunwang jiajunwang merged commit 96ebb27 into apache:master May 4, 2020
@jiajunwang jiajunwang deleted the retryOnAsync branch May 4, 2020 19:36
asfgit pushed a commit that referenced this pull request May 4, 2020
asfgit pushed a commit that referenced this pull request May 5, 2020
_operationRetryTimeoutInMillis = operationRetryTimeout;
_isNewSessionEventFired = false;

_asyncCallRetryThread = new ZkAsyncRetryThread(zkConnection.getServers());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should give name of this thread that can be tied to the ZkEvent thread name. This way, when we debug it, we know the relation. Otherwise it would be very hard to correlate and reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me do it in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was confused by myself. Name already given in this PR.
"ZkClient-AsyncCallback-Retry-" + getId() + "-" + name.

case CONNECTIONLOSS:
/** The session has been expired by the server */
case SESSIONEXPIRED:
/** Session moved to another server, so operation is ignored */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Aync call is normally for batch access from ZkBaseDataAccessor I believe. Here, the idea is to not create ephemeral nodes because SESSIONEXPIRED can be retry. Then we should probably fail ephemeral code creating asyncly too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are not using the async call to create ephemeral nodes. But this might be a concern. If that becomes the case, let's make the same change for async call.

int _rc = -1;
public static abstract class DefaultCallback implements CancellableZkAsyncCallback {
AtomicBoolean _isOperationDone = new AtomicBoolean(false);
int _rc = UNKNOWN_RET_CODE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this value from -1 to 255?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 is a valid error code.

"Cannot request to retry the operation. The retry request thread may have been stopped.");
}
} catch (Throwable t) {
LOG.error("Failed to request to retry the operation.", t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need retry, retri-able context, but retry operation failed? What to do here? Mark done, return some retriable RC value like CONNECTIONLOSS is not what the customer expect to handle right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. Because retry is not possible. So we have to return with a result instead of letting the caller pending forever.

huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
…pache#970)

If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result.
This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED).
So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.
huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants