Skip to content

Fix thread safety issue and add cache to EmptySegmentPruner#7828

Merged
Jackie-Jiang merged 2 commits intoapache:masterfrom
Jackie-Jiang:empty_segment_pruner_cache
Nov 29, 2021
Merged

Fix thread safety issue and add cache to EmptySegmentPruner#7828
Jackie-Jiang merged 2 commits intoapache:masterfrom
Jackie-Jiang:empty_segment_pruner_cache

Conversation

@Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Nov 25, 2021

Currently EmptySegmentPruner might have race condition where _emptySegments can be modified and accessed at the same time. Fix it by using a concurrent set for _emptySegments.
Also add cache to the pruner so that no need to recalculate the result given the same input segments.

@Jackie-Jiang Jackie-Jiang requested a review from npawar November 25, 2021 00:17
@richardstartin
Copy link
Member

Why not use a Sets.newConcurrentHashSet() rather than copy on write? That would simplify this a lot.

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2021

Codecov Report

Merging #7828 (0b26ea9) into master (1e25962) will decrease coverage by 0.02%.
The diff coverage is 84.37%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7828      +/-   ##
============================================
- Coverage     71.67%   71.64%   -0.03%     
  Complexity     4089     4089              
============================================
  Files          1579     1579              
  Lines         80843    80851       +8     
  Branches      12010    12017       +7     
============================================
- Hits          57944    57929      -15     
- Misses        18995    19018      +23     
  Partials       3904     3904              
Flag Coverage Δ
integration1 29.29% <31.25%> (-0.17%) ⬇️
integration2 27.92% <31.25%> (-0.12%) ⬇️
unittests1 68.66% <ø> (+<0.01%) ⬆️
unittests2 14.62% <84.37%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...oker/routing/segmentpruner/EmptySegmentPruner.java 91.37% <84.37%> (-8.63%) ⬇️
...data/manager/realtime/SegmentCommitterFactory.java 64.70% <0.00%> (-29.42%) ⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java 53.46% <0.00%> (-8.92%) ⬇️
...altime/ServerSegmentCompletionProtocolHandler.java 49.52% <0.00%> (-8.58%) ⬇️
...e/data/manager/realtime/SplitSegmentCommitter.java 53.84% <0.00%> (-7.70%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 47.66% <0.00%> (-7.26%) ⬇️
.../pinot/core/query/scheduler/PriorityScheduler.java 80.82% <0.00%> (-2.74%) ⬇️
...ot/common/protocols/SegmentCompletionProtocol.java 93.80% <0.00%> (-1.91%) ⬇️
...r/validation/RealtimeSegmentValidationManager.java 74.28% <0.00%> (-1.43%) ⬇️
...a/org/apache/pinot/core/common/DataBlockCache.java 96.21% <0.00%> (-0.76%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e25962...0b26ea9. Read the comment docs.

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin I actually considered that, and decided to go with the snapshot approach because we want to optimize the query runtime method as much as possible

@richardstartin
Copy link
Member

richardstartin commented Nov 25, 2021

I actually considered that, and decided to go with the snapshot approach because we want to optimize the query runtime method as much as possible

I don’t think this is likely to be an optimisation unless the removeAll call is made frequently and the number of empty segments is large.

I think when an optimisation makes the code much harder to read we need 4 bits of data:

  1. evidence that there is a problem - e.g. wall time thresholded tracing events around EmptySegmentPruner.prune, or a sampling profiler reporting lots of samples in this method
  2. evidence that the thing being optimised causes the problem - assuming we have data that there is a problem, is it the removeAll call (which using a HashSet may optimise relative to a concurrent variant) or the copy on line 151 which is the cause?
  3. evidence the problem is common - e.g. statistics about number of empty segments from a real cluster, do we have a metric for this?
  4. evidence the optimisation is effective - for widely observed set sizes, does it actually make a difference to have a HashSet instead of a concurrent HashSet?

I raise this only because the code is much harder to read than if it used a threadsafe Set. If it were as readable I would be inclined to assume that the optimisation is effective.


// Return the cached result when the input is the same reference
ResultCache resultCache = _resultCache;
if (resultCache != null && resultCache._inputSegments == segments) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use resultCache != null && resultCache._inputSegments.equals(segments) here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the COW mechanism, it's checking that the result cache wasn't built by some other racing thread, it has to be the same reference and the contents are irrelevant. This is why I challenge whether the performance justifies the complexity, because it's fairly subtle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to compare the actual content because it can be expensive

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin I didn't think from the readability perspective. I agree there is no clear evidence on the optimization. Changed it to use the concurrent set.

Copy link
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I can read it now 👍

@Jackie-Jiang Jackie-Jiang merged commit 9ba03e4 into apache:master Nov 29, 2021
@Jackie-Jiang Jackie-Jiang deleted the empty_segment_pruner_cache branch November 29, 2021 23:14
kriti-sc pushed a commit to kriti-sc/incubator-pinot that referenced this pull request Dec 12, 2021
)

Currently EmptySegmentPruner might have race condition where _emptySegments can be modified and accessed at the same time. Fix it by using a concurrent set for _emptySegments.
Also add cache to the pruner so that no need to recalculate the result given the same input segments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants