Fix ZkBucketDataAccessor failure due to concurrent modification.#1107
Merged
jiajunwang merged 2 commits intoapache:masterfrom Jun 23, 2020
Merged
Fix ZkBucketDataAccessor failure due to concurrent modification.#1107jiajunwang merged 2 commits intoapache:masterfrom
jiajunwang merged 2 commits intoapache:masterfrom
Conversation
The concurrent modification causes two issues. 1. Regular GC task fails due to concurrent list modification and the stale versions are not removed at all. 2. If, by coincident, there is newer version in the list other then the current version, then because of the modification of the list inside the loop, the final element (the newer version) won't be filtered but being left in the to-be-removed list. Then the GC task removes the most recent version. For example, a) Input, current version "2" b) Children = [1, 2, 3] c) The task avoids checking "2", so the list for loop is: [1, 3] d) When check "1", it is removed from the list. So the list becomes [3]. Then the loop ends, because the first item has already been looped from the for iteration perspective. e) The version to be removed is "3"! This PR fix the issue by avoiding concurrent modification. Also it simplies the logic so as to reduce the pending GC tasks. The test is also updated accordingly.
narendly
reviewed
Jun 20, 2020
helix-core/src/test/java/org/apache/helix/manager/zk/TestZkBucketDataAccessor.java
Outdated
Show resolved
Hide resolved
narendly
approved these changes
Jun 20, 2020
narendly
reviewed
Jun 20, 2020
Comment on lines
+334
to
+336
| if (_gcTaskFuture != null) { | ||
| _gcTaskFuture.cancel(false); | ||
| } |
Contributor
There was a problem hiding this comment.
Do we really need to cancel? What if that causes incomplete deletion of stale versions? I think we should let it queue up?
Contributor
Author
There was a problem hiding this comment.
This cancel will only cancel the pending tasks. For the pending tasks that are queued up, they will only cause extra children list read. The later operation will be the same. And if there is risk of incomplete deleting, then that concern lives no matter we let it queued up or not. That issue (if exists) should be addressed separately.
narendly
approved these changes
Jun 23, 2020
Contributor
Author
|
This PR is ready to be merged, approved by @narendly |
huizhilu
pushed a commit
to huizhilu/helix
that referenced
this pull request
Aug 16, 2020
…che#1107) Concurrent modification causes two issues. 1. Regular GC task fails due to concurrent list modification and the stale versions are not removed at all. 2. If, by coincident, there is a newer version in the list other then the current version, then because of the modification of the list inside the loop, the final element (the newer version) won't be filtered but being left in the to-be-removed list. Then the GC task removes the most recent version. For example, a) Input, current version "2" b) Children = [1, 2, 3] c) The task avoids checking "2", so the list for loop is: [1, 3] d) When checking "1", it is removed from the list. So the list becomes [3]. Then the loop ends, because the first item has already been looped from the for iteration perspective. e) The version to be removed is "3"! This PR fixes the issue by avoiding concurrent modification. Also, it simplifies the logic so as to reduce the pending GC tasks. The test is also updated accordingly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
#767
Description
Concurrent modification causes two issues.
a) Input, current version "2"
b) Children = [1, 2, 3]
c) The task avoids checking "2", so the list for loop is: [1, 3]
d) When checking "1", it is removed from the list. So the list becomes [3]. Then the loop ends, because the first item has already been looped from the for iteration perspective.
e) The version to be removed is "3"!
This PR fixes the issue by avoiding concurrent modifications. Also, it simplifies the logic so as to reduce the pending GC tasks.
The test is also updated accordingly.
Tests
TestZkBucketDataAccessor
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestJobQueueCleanUp.testJobQueueAutoCleanUp:111 Sets differ: expected [testJobQueueAutoCleanUp_JOB8, testJobQueueAutoCleanUp_JOB7, testJobQueueAutoCleanUp_JOB6, testJobQueueAutoCleanUp_JOB5, testJobQueueAutoCleanUp_JOB9] but got [testJobQueueAutoCleanUp_JOB4, testJobQueueAutoCleanUp_JOB8, testJobQueueAutoCleanUp_JOB7, testJobQueueAutoCleanUp_JOB6, testJobQueueAutoCleanUp_JOB5, testJobQueueAutoCleanUp_JOB9]
[INFO]
[ERROR] Tests run: 1145, Failures: 1, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:19 h
[INFO] Finished at: 2020-06-19T19:27:46-07:00
[INFO] ------------------------------------------------------------------------
Rerun
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.293 s - in org.apache.helix.integration.task.TestJobQueueCleanUp
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21.621 s
[INFO] Finished at: 2020-06-19T21:53:23-07:00
[INFO] ------------------------------------------------------------------------
Commits
Documentation (Optional)
(Link the GitHub wiki you added)
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)