Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

Merged
merged 1 commit into from May 29, 2021

Conversation

pengzhiwei2018
Copy link
Contributor

@pengzhiwei2018 pengzhiwei2018 commented May 8, 2021

What is the purpose of the pull request

Support partition pruned for merge on read snapshot query. This issue was exposed by #2283 which read hudi as spark DataSource table.

Brief change log

  • Convert filters in MergeOnReadSnapshotRelation#buildScan to Catalyst Expression and do the partition prune by HoodieFileIndex

Verify this pull request

  • Add TestConvertFilterToCatalystExpression to test the convention for filter.
  • Add TestMORDataSource#testMORPartitionPrune to test partition prune for MOR table.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented May 8, 2021

Codecov Report

Merging #2926 (edf6c6b) into master (a5789c4) will increase coverage by 2.10%.
The diff coverage is 95.00%.

❗ Current head edf6c6b differs from pull request most recent head c4b07cb. Consider uploading reports for the commit c4b07cb to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2926      +/-   ##
============================================
+ Coverage     53.00%   55.11%   +2.10%     
- Complexity     3743     3845     +102     
============================================
  Files           488      485       -3     
  Lines         23435    23512      +77     
  Branches       2500     2519      +19     
============================================
+ Hits          12422    12958     +536     
+ Misses         9913     9400     -513     
- Partials       1100     1154      +54     
Flag Coverage Δ
hudicli 39.55% <ø> (+0.01%) ⬆️
hudiclient ∅ <ø> (∅)
hudicommon 50.31% <ø> (-0.35%) ⬇️
hudiflink 63.40% <ø> (+4.28%) ⬆️
hudihadoopmr 51.54% <ø> (+18.21%) ⬆️
hudisparkdatasource 74.30% <95.00%> (+0.96%) ⬆️
hudisync 46.44% <ø> (+0.33%) ⬆️
huditimelineservice 64.36% <ø> (ø)
hudiutilities 70.88% <ø> (+1.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala 89.62% <88.88%> (-0.38%) ⬇️
.../main/scala/org/apache/hudi/HoodieSparkUtils.scala 91.57% <95.38%> (+8.24%) ⬆️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala 81.01% <100.00%> (+1.92%) ⬆️
.../main/java/org/apache/hudi/util/AvroConvertor.java 0.00% <0.00%> (-82.36%) ⬇️
.../hudi/table/format/mor/MergeOnReadInputFormat.java 65.14% <0.00%> (-9.86%) ⬇️
...he/hudi/common/bootstrap/index/BootstrapIndex.java 94.11% <0.00%> (-5.89%) ⬇️
...he/hudi/sink/partitioner/BucketAssignFunction.java 77.88% <0.00%> (-5.84%) ⬇️
...va/org/apache/hudi/sink/utils/HiveSyncContext.java 91.89% <0.00%> (-5.26%) ⬇️
...common/table/log/HoodieMergedLogRecordScanner.java 82.35% <0.00%> (-5.15%) ⬇️
.../java/org/apache/hudi/table/HoodieTableSource.java 58.60% <0.00%> (-4.77%) ⬇️
... and 65 more

@umehrot2 umehrot2 self-requested a review May 8, 2021 11:33
@vinothchandar vinothchandar added this to Opened PRs in PR Tracker Board May 10, 2021
@vinothchandar vinothchandar self-assigned this May 11, 2021
@vinothchandar vinothchandar moved this from Opened PRs to Ready for Review in PR Tracker Board May 11, 2021
@nsivabalan nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021
@ParameterizedTest
@ValueSource(booleans = Array(true, false))
def testMORPartitionPrune(partitionEncode: Boolean): Unit = {
val partitions = Array("2021/03/01", "2021/03/02", "2021/03/03", "2021/03/04", "2021/03/05")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the hive style partitioning year=2021/xxx work as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the hive style partition is supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for hive style as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! done!

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits. LGTM overall. We can land once @umehrot2 also take a pass and @garyli1019 is happy with the change


// Get partition filter and convert to catalyst expression
val partitionColumns = hoodieFileIndex.partitionSchema.fieldNames.toSet
val partitionFilters= filters.filter(f => f.references.forall(p => partitionColumns.contains(p)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

* Convert Filters to Catalyst Expressions and joined by And. If convert success return an
* Non-Empty Option[Expression],or else return None.
*/
def convertToCatalystExpressions(filters: Array[Filter],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we encapsulate this conversion logic into its own class. I could see general use for this, beyond just partition pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me!

PR Tracker Board automation moved this from Ready for Review to Nearing Landing May 25, 2021
Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand some parts of the source code as this is new domain to me. but on the whole(looked at the tests), LGTM. and def a good feature to add to hudi. thanks for your contribution :)

@garyli1019 garyli1019 merged commit dcd7c33 into apache:master May 29, 2021
PR Tracker Board automation moved this from Nearing Landing to Done May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants