Skip to content

Conversation

@klsince
Copy link
Contributor

@klsince klsince commented Feb 23, 2022

Description

Narrowing the scope of change to skip empty segmentsFrom when checking lineage conflict, as in some use cases lineages always have empty segmentsFrom set. More conflict checks on segmentsTo set to be added with next PR with more discussions noted there.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2022

Codecov Report

Merging #8242 (5b23417) into master (27d690d) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #8242      +/-   ##
============================================
- Coverage     70.74%   70.69%   -0.06%     
- Complexity     4238     4241       +3     
============================================
  Files          1629     1629              
  Lines         85213    85217       +4     
  Branches      12830    12830              
============================================
- Hits          60287    60246      -41     
- Misses        20767    20807      +40     
- Partials       4159     4164       +5     
Flag Coverage Δ
integration1 28.68% <10.00%> (-0.16%) ⬇️
integration2 27.54% <0.00%> (-0.07%) ⬇️
unittests1 66.98% <ø> (+0.03%) ⬆️
unittests2 14.10% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ntroller/helix/core/PinotHelixResourceManager.java 66.36% <100.00%> (+0.30%) ⬆️
...data/manager/realtime/DefaultSegmentCommitter.java 0.00% <0.00%> (-80.00%) ⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java 43.56% <0.00%> (-18.82%) ⬇️
...data/manager/realtime/SegmentCommitterFactory.java 88.23% <0.00%> (-11.77%) ⬇️
...pinot/common/utils/fetcher/HttpSegmentFetcher.java 61.53% <0.00%> (-10.26%) ⬇️
...altime/ServerSegmentCompletionProtocolHandler.java 51.42% <0.00%> (-6.67%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 41.96% <0.00%> (-5.19%) ⬇️
...rg/apache/pinot/server/starter/ServerInstance.java 88.78% <0.00%> (-2.81%) ⬇️
.../pinot/core/query/scheduler/PriorityScheduler.java 80.82% <0.00%> (-2.74%) ⬇️
...core/query/pruner/SelectionQuerySegmentPruner.java 86.36% <0.00%> (-2.28%) ⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27d690d...5b23417. Read the comment docs.

@klsince klsince force-pushed the allow_empty_segmentfroms branch from f4991b4 to 0ddec73 Compare February 24, 2022 21:29
@klsince klsince requested a review from jackjlli February 24, 2022 23:05
Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the changes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the logic here. Say I have a refresh table, and I scheduled multiple segment ingestion tasks in parallel, why should one task kill another?
IMO, we should only perform the disjoint check because having common source segments can result in duplicate data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @snleee to help chime in, in case there are any corner cases for REFRESH table. Otherwise, I'll cool with just using disjoint()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can elaborate on the reason why multiple segment ingestion tasks cannot be done in parallel for a refresh table. That's mainly due to how the segment name is generated.
For refresh table, since there is no time column specified, there will be no min/max value shown in the segment name; the generated segment name will be like testTable_postfix_0. The suffix _0 only denotes the identifier for the segment within a batch ingestion job.
And since there is no relationship between the input raw file names and output segment names, there is no way to tell which raw file maps to which output segment within a batch. Thus, there is no way to backfill only a subset of segments for refresh table, i.e. if we want to update some data in one segment, we need to replace the whole data of it, because we don't know which segment we actually need to replace with the new data.
That's why if there are two ingestion tasks running in parallel for the same refresh table, either one of them should be banned (either the latter push job which detects the same segment names from segmentFrom field if forceCleanup = false, or the former push job with forceCleanup = true so that its lineage state will be marked as reverted).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be okay to just use disjoint() method here since the purpose is just to find out whether there is some segment names appearing in multiple segment lineages or not. That doesn't harm to make the logic of the validation stricter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackjlli We should decouple the logic of lineage from table push type. Lineage is designed to achieve atomic segments replacement, and we don't want to mix it with push type which can make it very confusing. If we want to do atomic full table refresh, we should generate segments with different names, and put old segments as segmentsFrom. Empty segmentsFrom means I want to add new segments without replacing existing ones, and we should allow multiple of such operations in parallel.

Copy link
Contributor

@snleee snleee Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang

My reasoning was based on the following assumption:

For REFRESH table, each offline ingestion flow would push the full data snapshot.

If this is the case, the correct behavior should be blocking if 2 flows with segmentsFrom=empty are running at the same time because this can result in the REFRESH table having 2 snapshots of data. (Even if the input files are the same, we now put timestamp to the segment name so all data from 2 flows will be uploaded)

If we can make the assumption that we have 1 flow per REFRESH table, the above case can only happen if some user runs 2 push flows at the same time for an empty table, which will not likely happen. Then, I see your point that it's cleaner if we don't handle special logic for different push types. I'm fine with removing special handling for REFRESH.

On the other hand, if you have REFRESH tables with multiple flows set up, I think that this model will be hard to support the consistent push feature because you will have the hard time differentiating the segments from one flow to another as @jackjlli mentioned above. IMO, we should not support such model (having 2+ flows pushing to a single REFRESH table).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackjlli @snleee In order to handle such case, we should also check if segmentsTo are disjoint. We don't want to override the just pushed segments. These checks are already there for non-force-cleanup branch, not sure why it is not done in the same way for force-cleanup branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry folks, I was distracted from this PR. IIUC, we'd like to avoid the specific handling on refresh table and simply do the disjoint check on segmentsFrom for now. So I just left the UT cases in this PR. I'll probably close the PR if we still need more discussions to converge on this. but lemme know your thoughts.

@klsince klsince force-pushed the allow_empty_segmentfroms branch from 5b23417 to 8722b9d Compare March 18, 2022 17:38
@klsince klsince changed the title handle lineage with empty segmentsFrom add UT for disjoint check on lineage segmentFroms Mar 18, 2022
@Jackie-Jiang Jackie-Jiang merged commit 114e4c5 into apache:master Mar 18, 2022
KKcorps pushed a commit to KKcorps/incubator-pinot that referenced this pull request Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants