New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1800] Exclude file slices in pending compaction when performing small file sizing #2902
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2902 +/- ##
=============================================
- Coverage 69.75% 54.96% -14.80%
- Complexity 375 3844 +3469
=============================================
Files 54 485 +431
Lines 1997 23436 +21439
Branches 236 2494 +2258
=============================================
+ Hits 1393 12881 +11488
- Misses 473 9401 +8928
- Partials 131 1154 +1023
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@nsivabalan did you hav chance to review this? |
I didn't know I was assigned to review this. Will get to this in a day or two. |
WorkloadProfile profile = new WorkloadProfile(buildProfile(jsc.parallelize(insertRecords))); | ||
|
||
HoodieSparkTable table = HoodieSparkTable.create(config, context, metaClient); | ||
SparkUpsertDeltaCommitPartitioner partitioner = new SparkUpsertDeltaCommitPartitioner(profile, context, table, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't see any assertions with the partitioner. Did you miss to update the patch by any chance. If not, would be good to assert that records are routed to file groups of interest. I believe UpsertPartitioner has public apis to assist in this assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am fairly certain that this will not miss out new file groups where it has only log files w/o any base files. But can we add a test to cover this scenario as well. just to have a comprehensive test.
scenario:
create 1 or 2 delta commits to a new file group(FG1).
also have few other file groups w/ base files and some delta files.
ingest new batch. inserts should be routed to file group FG1 as well if adheres to small file sizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO its sufficient to cover just the case that this PR handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mention, can be a bonus @nsivabalan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add assertion because I was testing that error was no longer thrown but I agree I will add to make sure files got routed to correct file groups.
I can also add this test case
@rmpifer fix itself looks good! Are you able to add the test case here, so we can land? |
Can you check why CI is failing. we can land once fixed. |
Tips
What is the purpose of the pull request
Small file size handling right now considers file groups in pending compaction. However when a compaction is in progress, there are no base or log files present yet in the file group. Small file sizing code expects either a base or log file is present and will throw an exception
JIRA: https://issues.apache.org/jira/browse/HUDI-1800
GH Issue: #2633
Brief change log
Excludes pending compactions from consideration when performing small file sizing. Based on conversation in GH decided to move forward with this approach
(for example:)
Verify this pull request
Was able to reproduce error in the unit test added
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.