-
Notifications
You must be signed in to change notification settings - Fork 1.4k
add UT for disjoint check on lineage segmentFroms #8242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8242 +/- ##
============================================
- Coverage 70.74% 70.69% -0.06%
- Complexity 4238 4241 +3
============================================
Files 1629 1629
Lines 85213 85217 +4
Branches 12830 12830
============================================
- Hits 60287 60246 -41
- Misses 20767 20807 +40
- Partials 4159 4164 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
...ller/src/test/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManagerTest.java
Outdated
Show resolved
Hide resolved
...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
Outdated
Show resolved
Hide resolved
...ller/src/test/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManagerTest.java
Outdated
Show resolved
Hide resolved
...ntroller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java
Outdated
Show resolved
Hide resolved
f4991b4 to
0ddec73
Compare
jackjlli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for making the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the logic here. Say I have a refresh table, and I scheduled multiple segment ingestion tasks in parallel, why should one task kill another?
IMO, we should only perform the disjoint check because having common source segments can result in duplicate data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @snleee to help chime in, in case there are any corner cases for REFRESH table. Otherwise, I'll cool with just using disjoint()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can elaborate on the reason why multiple segment ingestion tasks cannot be done in parallel for a refresh table. That's mainly due to how the segment name is generated.
For refresh table, since there is no time column specified, there will be no min/max value shown in the segment name; the generated segment name will be like testTable_postfix_0. The suffix _0 only denotes the identifier for the segment within a batch ingestion job.
And since there is no relationship between the input raw file names and output segment names, there is no way to tell which raw file maps to which output segment within a batch. Thus, there is no way to backfill only a subset of segments for refresh table, i.e. if we want to update some data in one segment, we need to replace the whole data of it, because we don't know which segment we actually need to replace with the new data.
That's why if there are two ingestion tasks running in parallel for the same refresh table, either one of them should be banned (either the latter push job which detects the same segment names from segmentFrom field if forceCleanup = false, or the former push job with forceCleanup = true so that its lineage state will be marked as reverted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be okay to just use disjoint() method here since the purpose is just to find out whether there is some segment names appearing in multiple segment lineages or not. That doesn't harm to make the logic of the validation stricter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackjlli We should decouple the logic of lineage from table push type. Lineage is designed to achieve atomic segments replacement, and we don't want to mix it with push type which can make it very confusing. If we want to do atomic full table refresh, we should generate segments with different names, and put old segments as segmentsFrom. Empty segmentsFrom means I want to add new segments without replacing existing ones, and we should allow multiple of such operations in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was based on the following assumption:
For REFRESH table, each offline ingestion flow would push the full data snapshot.
If this is the case, the correct behavior should be blocking if 2 flows with segmentsFrom=empty are running at the same time because this can result in the REFRESH table having 2 snapshots of data. (Even if the input files are the same, we now put timestamp to the segment name so all data from 2 flows will be uploaded)
If we can make the assumption that we have 1 flow per REFRESH table, the above case can only happen if some user runs 2 push flows at the same time for an empty table, which will not likely happen. Then, I see your point that it's cleaner if we don't handle special logic for different push types. I'm fine with removing special handling for REFRESH.
On the other hand, if you have REFRESH tables with multiple flows set up, I think that this model will be hard to support the consistent push feature because you will have the hard time differentiating the segments from one flow to another as @jackjlli mentioned above. IMO, we should not support such model (having 2+ flows pushing to a single REFRESH table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry folks, I was distracted from this PR. IIUC, we'd like to avoid the specific handling on refresh table and simply do the disjoint check on segmentsFrom for now. So I just left the UT cases in this PR. I'll probably close the PR if we still need more discussions to converge on this. but lemme know your thoughts.
5b23417 to
8722b9d
Compare
Description
Narrowing the scope of change to skip empty segmentsFrom when checking lineage conflict, as in some use cases lineages always have empty segmentsFrom set. More conflict checks on segmentsTo set to be added with next PR with more discussions noted there.
Upgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
backward-incompat, and complete the section below on Release Notes)Does this PR fix a zero-downtime upgrade introduced earlier?
backward-incompat, and complete the section below on Release Notes)Does this PR otherwise need attention when creating release notes? Things to consider:
release-notesand complete the section on Release Notes)Release Notes
Documentation