New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition when 2 segment upload occurred for the same segment #9905
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9905 +/- ##
============================================
- Coverage 70.46% 70.44% -0.02%
+ Complexity 5535 4983 -552
============================================
Files 1982 1982
Lines 106449 106460 +11
Branches 16131 16134 +3
============================================
- Hits 75006 74996 -10
- Misses 26213 26218 +5
- Partials 5230 5246 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
952bba9
to
52d21d3
Compare
Is this an attempt to fix the flakiness in ZKOperatorTest? If so, the root cause is actually a Helix bug, where multiple IS changes is causing the last change being ignored. More context here: #9921 I don't fully follow this race condition. Seems it is within the parallel push protection, where only one thread should be able to push the segment |
Let me explain a bit here. Yes, there should be only one thread that pushes the segment on the client side. While it could happen that the 1st attempt spends too much time (e.g. due to very slow access to PinotFS) on uploading segment, which made the thread gave up its 1st attempt and retry the segment upload (2nd attempt), and the 2nd attempt succeeded. When the 1st attempt finally finished its work, it turned out that it failed to update the ZK metadata any more. Since the client side has already given up the 1st attempt (which leads to the 2nd attempt), the 1st segment upload shouldn't blindly delete the segment. Instead, the 1st controller should validate the crc one more time, and if crc remains the same, segment deletion should be skipped. |
@Jackie-Jiang This PR doesn't intend to fix the |
52d21d3
to
8900ac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
pinot-controller/src/test/java/org/apache/pinot/controller/api/upload/ZKOperatorTest.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/api/upload/ZKOperator.java
Outdated
Show resolved
Hide resolved
8900ac3
to
9c83694
Compare
9c83694
to
7960c20
Compare
There is a race condition when it took too much time for the 1st segment upload to process (due to slow PinotFS access), which leads to the 2nd attempt of segment upload, and the 2nd segment upload succeeded. In this case, when the 1st upload comes back, it shouldn't blindly delete the segment when it failed to update the zk metadata. Instead, the 1st attempt should validate the upload start time one more time. If the upload start time doesn't match with the one persisted in zk metadata, segment deletion should be skipped.