Add async call retry to resolve the transient ZK connection issue.#970
Add async call retry to resolve the transient ZK connection issue.#970jiajunwang merged 4 commits intoapache:masterfrom
Conversation
9a433bb to
10a8459
Compare
| data == null ? 0 : data.length, false) { | ||
| @Override | ||
| protected void doRetry() { | ||
| doAsyncSetData(path, data, version, System.currentTimeMillis(), cb); |
There was a problem hiding this comment.
Is this recursively self calling OK?
There was a problem hiding this comment.
This is the design. If no connectivity issue, it won't be triggered. The assumption here is that the connectivity issue is transient and won't happen continuously.
There was a problem hiding this comment.
I am not quite familiar with the way you use. I just wonder would that cause infinite call then stack overflow?
There was a problem hiding this comment.
If I say it won't, I guess you won't believe it so easily. Please take a look at the code carefully. In general, it is not a recursive call, it is a callback triggered in a different thread after this method is done. Even we keep retrying, only one call exists in the stack.
| LOG.error("Failed to request to retry the operation.", t); | ||
| } | ||
| } else { | ||
| LOG.warn( |
There was a problem hiding this comment.
Would this be to many introduced for log?
There was a problem hiding this comment.
I don't think so.
The first two logs only happen when retry is not possible because of some unknown error. We need to know in this case.
The third one happens if Helix devs changed the Helix code in a strange way. It never happens with the existing PR code.
Overall, if everything works fine, we won't see any error messages.
There was a problem hiding this comment.
about "Helix devs changed the Helix code in a strange way":
It would be great if you can identify some of the places that code change can cause the issue and put some comments in that parts? so people will be careful about their changes that they are making later.
There was a problem hiding this comment.
@alirezazamani I understand your concern. But there are many many ways to make it happen. I won't be able to comment about every case. Moreover, strictly speaking, this final check is not like an NPE check, this is not a failure case. This is just meaning that the caller does not want a retry. And this is not something we have to avoid, if logic requires, we shall still do it. It is just not happening, and a little bit strang for the current logic. But we shall not discourage people to use it in a different way in the future. At that moment, if this log becomes too verbose, we can downgrade it or remove it.
alirezazamani
left a comment
There was a problem hiding this comment.
Good job. If we are sure that we will not be stuck in retying and we are not causing stack overflow (which I believe you already explained the reason in the reviews), I don't see any issue with this PR. However, since this is critical part of the code, I would suggest to wait to get feedbacks from few other members before merging.
The current asyn callback will fail the operation and may return partial results eventually, if any exceptions happen during the call. This change will make the ZkClient retry on the temporary ZK connection issues (CONNECTIONLOSS, OPERATIONTIMEOUT. SESSIONEXPIRED, SESSIONMOVED). So it has a larger chance to finish the operation if possible. Note that if the exception is due to business logic, the operation will still fail and the same return code will be sent to the callback handler.
|
This PR is ready to be merged, approved by @alirezazamani @dasahcc |
| _operationRetryTimeoutInMillis = operationRetryTimeout; | ||
| _isNewSessionEventFired = false; | ||
|
|
||
| _asyncCallRetryThread = new ZkAsyncRetryThread(zkConnection.getServers()); |
There was a problem hiding this comment.
We should give name of this thread that can be tied to the ZkEvent thread name. This way, when we debug it, we know the relation. Otherwise it would be very hard to correlate and reason.
There was a problem hiding this comment.
Good point, let me do it in a separate PR.
There was a problem hiding this comment.
Sorry I was confused by myself. Name already given in this PR.
"ZkClient-AsyncCallback-Retry-" + getId() + "-" + name.
| case CONNECTIONLOSS: | ||
| /** The session has been expired by the server */ | ||
| case SESSIONEXPIRED: | ||
| /** Session moved to another server, so operation is ignored */ |
There was a problem hiding this comment.
These Aync call is normally for batch access from ZkBaseDataAccessor I believe. Here, the idea is to not create ephemeral nodes because SESSIONEXPIRED can be retry. Then we should probably fail ephemeral code creating asyncly too, right?
There was a problem hiding this comment.
I think we are not using the async call to create ephemeral nodes. But this might be a concern. If that becomes the case, let's make the same change for async call.
| int _rc = -1; | ||
| public static abstract class DefaultCallback implements CancellableZkAsyncCallback { | ||
| AtomicBoolean _isOperationDone = new AtomicBoolean(false); | ||
| int _rc = UNKNOWN_RET_CODE; |
There was a problem hiding this comment.
why change this value from -1 to 255?
There was a problem hiding this comment.
-1 is a valid error code.
| "Cannot request to retry the operation. The retry request thread may have been stopped."); | ||
| } | ||
| } catch (Throwable t) { | ||
| LOG.error("Failed to request to retry the operation.", t); |
There was a problem hiding this comment.
Need retry, retri-able context, but retry operation failed? What to do here? Mark done, return some retriable RC value like CONNECTIONLOSS is not what the customer expect to handle right?
There was a problem hiding this comment.
It is. Because retry is not possible. So we have to return with a result instead of letting the caller pending forever.
…pache#970) If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result. This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED). So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.
…ssue. (apache#970)" This reverts commit 96ebb27.
…ection issue. (apache#970)"" This reverts commit 370e277.
Issues
#969
Description
If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result.
This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED).
So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.
Tests
TestZkClientAsyncRetry
Keep running 75 times to ensure it's stable.
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.558 s - in org.apache.helix.zookeeper.impl.client.TestZkClientAsyncRetry
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.896 s
[INFO] Finished at: 2020-04-25T15:56:37-07:00
[INFO] ------------------------------------------------------------------------
//======================================================================
Attempt 75 TestZkClientAsyncRetry
//======================================================================
zookeeper-api
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.118 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.684 s
[INFO] Finished at: 2020-04-25T16:20:32-07:00
[INFO] ------------------------------------------------------------------------
helix-core
[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4,621.077 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1144, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:17 h
[INFO] Finished at: 2020-04-27T12:34:24-07:00
[INFO] ------------------------------------------------------------------------
helix-rest
[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 193.83 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 159, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:19 min
[INFO] Finished at: 2020-04-25T16:51:55-07:00
[INFO] ------------------------------------------------------------------------
Commits
Documentation (Optional)
(Link the GitHub wiki you added)
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)