Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1800] Exclude file slices in pending compaction when performing small file sizing #2902

Merged
merged 1 commit into from May 29, 2021

Conversation

rmpifer
Copy link
Contributor

@rmpifer rmpifer commented Apr 30, 2021

Tips

What is the purpose of the pull request

Small file size handling right now considers file groups in pending compaction. However when a compaction is in progress, there are no base or log files present yet in the file group. Small file sizing code expects either a base or log file is present and will throw an exception

JIRA: https://issues.apache.org/jira/browse/HUDI-1800
GH Issue: #2633

Brief change log

Excludes pending compactions from consideration when performing small file sizing. Based on conversation in GH decided to move forward with this approach

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

Was able to reproduce error in the unit test added

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@rmpifer rmpifer changed the title Exclude file slices in pending compaction when performing small file sizing [HUDI-1800] Exclude file slices in pending compaction when performing small file sizing Apr 30, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2021

Codecov Report

Merging #2902 (f97167a) into master (3ca9030) will decrease coverage by 14.79%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2902       +/-   ##
=============================================
- Coverage     69.75%   54.96%   -14.80%     
- Complexity      375     3844     +3469     
=============================================
  Files            54      485      +431     
  Lines          1997    23436    +21439     
  Branches        236     2494     +2258     
=============================================
+ Hits           1393    12881    +11488     
- Misses          473     9401     +8928     
- Partials        131     1154     +1023     
Flag Coverage Δ
hudicli 39.55% <ø> (?)
hudiclient ∅ <ø> (∅)
hudicommon 50.29% <ø> (?)
hudiflink 63.40% <ø> (?)
hudihadoopmr 51.54% <ø> (?)
hudisparkdatasource 73.33% <ø> (?)
hudisync 46.44% <ø> (?)
huditimelineservice 64.36% <ø> (?)
hudiutilities 70.83% <ø> (+1.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...s/deltastreamer/HoodieMultiTableDeltaStreamer.java 76.19% <0.00%> (-2.34%) ⬇️
...apache/hudi/utilities/deltastreamer/DeltaSync.java 70.84% <0.00%> (-0.59%) ⬇️
...udi/utilities/deltastreamer/BootstrapExecutor.java 82.35% <0.00%> (ø)
...common/model/HoodieFailedWritesCleaningPolicy.java 0.00% <0.00%> (ø)
...a/org/apache/hudi/common/util/CompactionUtils.java 93.65% <0.00%> (ø)
.../org/apache/hudi/common/engine/EngineProperty.java 0.00% <0.00%> (ø)
...java/org/apache/hudi/common/util/NetworkUtils.java 0.00% <0.00%> (ø)
...apache/hudi/common/config/DefaultHoodieConfig.java 60.00% <0.00%> (ø)
.../org/apache/hudi/io/storage/HoodieHFileReader.java 0.00% <0.00%> (ø)
...n/java/org/apache/hudi/common/model/FileSlice.java 76.19% <0.00%> (ø)
... and 427 more

@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Apr 30, 2021
@vinothchandar vinothchandar added this to Opened PRs in PR Tracker Board May 6, 2021
@vinothchandar
Copy link
Member

@nsivabalan did you hav chance to review this?

@vinothchandar vinothchandar moved this from Opened PRs to Ready for Review in PR Tracker Board May 11, 2021
@nsivabalan
Copy link
Contributor

I didn't know I was assigned to review this. Will get to this in a day or two.

WorkloadProfile profile = new WorkloadProfile(buildProfile(jsc.parallelize(insertRecords)));

HoodieSparkTable table = HoodieSparkTable.create(config, context, metaClient);
SparkUpsertDeltaCommitPartitioner partitioner = new SparkUpsertDeltaCommitPartitioner(profile, context, table, config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't see any assertions with the partitioner. Did you miss to update the patch by any chance. If not, would be good to assert that records are routed to file groups of interest. I believe UpsertPartitioner has public apis to assist in this assertion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am fairly certain that this will not miss out new file groups where it has only log files w/o any base files. But can we add a test to cover this scenario as well. just to have a comprehensive test.

scenario:
create 1 or 2 delta commits to a new file group(FG1).
also have few other file groups w/ base files and some delta files.
ingest new batch. inserts should be routed to file group FG1 as well if adheres to small file sizing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its sufficient to cover just the case that this PR handles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mention, can be a bonus @nsivabalan ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add assertion because I was testing that error was no longer thrown but I agree I will add to make sure files got routed to correct file groups.

I can also add this test case

PR Tracker Board automation moved this from Ready for Review to Nearing Landing May 17, 2021
@vinothchandar
Copy link
Member

@rmpifer fix itself looks good! Are you able to add the test case here, so we can land?

@nsivabalan
Copy link
Contributor

nsivabalan commented May 25, 2021

Can you check why CI is failing. we can land once fixed.

@nsivabalan nsivabalan merged commit 0709c62 into apache:master May 29, 2021
PR Tracker Board automation moved this from Nearing Landing to Done May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants