Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when 2 segment upload occurred for the same segment #9905

Merged
merged 1 commit into from Dec 7, 2022

Conversation

jackjlli
Copy link
Contributor

@jackjlli jackjlli commented Dec 2, 2022

There is a race condition when it took too much time for the 1st segment upload to process (due to slow PinotFS access), which leads to the 2nd attempt of segment upload, and the 2nd segment upload succeeded. In this case, when the 1st upload comes back, it shouldn't blindly delete the segment when it failed to update the zk metadata. Instead, the 1st attempt should validate the upload start time one more time. If the upload start time doesn't match with the one persisted in zk metadata, segment deletion should be skipped.

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2022

Codecov Report

Merging #9905 (7960c20) into master (173916e) will decrease coverage by 0.01%.
The diff coverage is 52.38%.

@@             Coverage Diff              @@
##             master    #9905      +/-   ##
============================================
- Coverage     70.46%   70.44%   -0.02%     
+ Complexity     5535     4983     -552     
============================================
  Files          1982     1982              
  Lines        106449   106460      +11     
  Branches      16131    16134       +3     
============================================
- Hits          75006    74996      -10     
- Misses        26213    26218       +5     
- Partials       5230     5246      +16     
Flag Coverage Δ
integration1 25.03% <9.52%> (-0.20%) ⬇️
integration2 24.57% <9.52%> (+0.07%) ⬆️
unittests1 67.94% <ø> (-0.01%) ⬇️
unittests2 15.84% <52.38%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../controller/helix/core/SegmentDeletionManager.java 73.58% <ø> (-0.17%) ⬇️
...apache/pinot/controller/api/upload/ZKOperator.java 69.66% <52.38%> (-0.22%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 41.45% <0.00%> (-11.40%) ⬇️
...perator/filter/SortedIndexBasedFilterOperator.java 85.71% <0.00%> (-6.67%) ⬇️
...mmon/request/context/predicate/NotInPredicate.java 87.50% <0.00%> (-6.25%) ⬇️
...lix/core/realtime/PinotRealtimeSegmentManager.java 75.39% <0.00%> (-5.24%) ⬇️
...l/segment/index/readers/OnHeapFloatDictionary.java 86.36% <0.00%> (-4.55%) ⬇️
...lter/predicate/NotInPredicateEvaluatorFactory.java 73.94% <0.00%> (-4.23%) ⬇️
...ct/raw/RawIntSingleColumnDistinctOnlyExecutor.java 81.48% <0.00%> (-3.71%) ⬇️
...re/accounting/PerQueryCPUMemAccountantFactory.java 66.77% <0.00%> (-2.38%) ⬇️
... and 23 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jackjlli jackjlli force-pushed the fix-segment-upload-race-condition branch 2 times, most recently from 952bba9 to 52d21d3 Compare December 6, 2022 07:48
@Jackie-Jiang
Copy link
Contributor

Is this an attempt to fix the flakiness in ZKOperatorTest? If so, the root cause is actually a Helix bug, where multiple IS changes is causing the last change being ignored. More context here: #9921

I don't fully follow this race condition. Seems it is within the parallel push protection, where only one thread should be able to push the segment

@jackjlli
Copy link
Contributor Author

jackjlli commented Dec 7, 2022

Seems it is within the parallel push protection, where only one thread should be able to push the segment.

Let me explain a bit here. Yes, there should be only one thread that pushes the segment on the client side. While it could happen that the 1st attempt spends too much time (e.g. due to very slow access to PinotFS) on uploading segment, which made the thread gave up its 1st attempt and retry the segment upload (2nd attempt), and the 2nd attempt succeeded. When the 1st attempt finally finished its work, it turned out that it failed to update the ZK metadata any more.

Since the client side has already given up the 1st attempt (which leads to the 2nd attempt), the 1st segment upload shouldn't blindly delete the segment. Instead, the 1st controller should validate the crc one more time, and if crc remains the same, segment deletion should be skipped.

@jackjlli
Copy link
Contributor Author

jackjlli commented Dec 7, 2022

@Jackie-Jiang This PR doesn't intend to fix the ZKOperatorTest but it faces the same test failure on ZKOperatorTest in Github Actions. That's why I tried to see if it can be fixed in this same PR. I can remove the code change for ZKOperatorTest once your PR is merged to master branch.

@jackjlli jackjlli force-pushed the fix-segment-upload-race-condition branch from 52d21d3 to 8900ac3 Compare December 7, 2022 01:35
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@jackjlli jackjlli force-pushed the fix-segment-upload-race-condition branch from 8900ac3 to 9c83694 Compare December 7, 2022 06:50
@jackjlli jackjlli force-pushed the fix-segment-upload-race-condition branch from 9c83694 to 7960c20 Compare December 7, 2022 06:59
@jackjlli jackjlli merged commit edffe1a into master Dec 7, 2022
@jackjlli jackjlli deleted the fix-segment-upload-race-condition branch December 7, 2022 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants