[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

pengzhiwei2018 · 2021-05-08T11:09:24Z

What is the purpose of the pull request

Support partition pruned for merge on read snapshot query. This issue was exposed by #2283 which read hudi as spark DataSource table.

Brief change log

Convert filters in MergeOnReadSnapshotRelation#buildScan to Catalyst Expression and do the partition prune by HoodieFileIndex

Verify this pull request

Add TestConvertFilterToCatalystExpression to test the convention for filter.
Add TestMORDataSource#testMORPartitionPrune to test partition prune for MOR table.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2021-05-08T11:16:33Z

Codecov Report

Merging #2926 (edf6c6b) into master (a5789c4) will increase coverage by 2.10%.
The diff coverage is 95.00%.

❗ Current head edf6c6b differs from pull request most recent head c4b07cb. Consider uploading reports for the commit c4b07cb to get more accurate results

@@             Coverage Diff              @@
##             master    #2926      +/-   ##
============================================
+ Coverage     53.00%   55.11%   +2.10%     
- Complexity     3743     3845     +102     
============================================
  Files           488      485       -3     
  Lines         23435    23512      +77     
  Branches       2500     2519      +19     
============================================
+ Hits          12422    12958     +536     
+ Misses         9913     9400     -513     
- Partials       1100     1154      +54

Flag	Coverage Δ
hudicli	`39.55% <ø> (+0.01%)`	⬆️
hudiclient	`∅ <ø> (∅)`
hudicommon	`50.31% <ø> (-0.35%)`	⬇️
hudiflink	`63.40% <ø> (+4.28%)`	⬆️
hudihadoopmr	`51.54% <ø> (+18.21%)`	⬆️
hudisparkdatasource	`74.30% <95.00%> (+0.96%)`	⬆️
hudisync	`46.44% <ø> (+0.33%)`	⬆️
huditimelineservice	`64.36% <ø> (ø)`
hudiutilities	`70.88% <ø> (+1.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala	`89.62% <88.88%> (-0.38%)`	⬇️
.../main/scala/org/apache/hudi/HoodieSparkUtils.scala	`91.57% <95.38%> (+8.24%)`	⬆️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala	`81.01% <100.00%> (+1.92%)`	⬆️
.../main/java/org/apache/hudi/util/AvroConvertor.java	`0.00% <0.00%> (-82.36%)`	⬇️
.../hudi/table/format/mor/MergeOnReadInputFormat.java	`65.14% <0.00%> (-9.86%)`	⬇️
...he/hudi/common/bootstrap/index/BootstrapIndex.java	`94.11% <0.00%> (-5.89%)`	⬇️
...he/hudi/sink/partitioner/BucketAssignFunction.java	`77.88% <0.00%> (-5.84%)`	⬇️
...va/org/apache/hudi/sink/utils/HiveSyncContext.java	`91.89% <0.00%> (-5.26%)`	⬇️
...common/table/log/HoodieMergedLogRecordScanner.java	`82.35% <0.00%> (-5.15%)`	⬇️
.../java/org/apache/hudi/table/HoodieTableSource.java	`58.60% <0.00%> (-4.77%)`	⬇️
... and 65 more

garyli1019 · 2021-05-15T03:11:36Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

+  @ParameterizedTest
+  @ValueSource(booleans = Array(true, false))
+  def testMORPartitionPrune(partitionEncode: Boolean): Unit = {
+    val partitions = Array("2021/03/01", "2021/03/02", "2021/03/03", "2021/03/04", "2021/03/05")


does the hive style partitioning year=2021/xxx work as well?

Yes, the hive style partition is supported.

should we add a test for hive style as well?

Nice suggestion! done!

vinothchandar

Few nits. LGTM overall. We can land once @umehrot2 also take a pass and @garyli1019 is happy with the change

vinothchandar · 2021-05-25T15:51:18Z

...spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

+
+      // Get partition filter and convert to catalyst expression
+      val partitionColumns = hoodieFileIndex.partitionSchema.fieldNames.toSet
+      val partitionFilters= filters.filter(f => f.references.forall(p => partitionColumns.contains(p)))


nit: space before =

vinothchandar · 2021-05-25T15:55:59Z

...spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

+   * Convert Filters to Catalyst Expressions and joined by And. If convert success return an
+   * Non-Empty Option[Expression],or else return None.
+   */
+  def convertToCatalystExpressions(filters: Array[Filter],


can we encapsulate this conversion logic into its own class. I could see general use for this, beyond just partition pruning?

Make sense to me!

garyli1019

LGTM!

nsivabalan

I don't understand some parts of the source code as this is new domain to me. but on the whole(looked at the tests), LGTM. and def a good feature to add to hudi. thanks for your contribution :)

pengzhiwei2018 force-pushed the dev_1879_issue2 branch from 35852ae to 31e95ca Compare May 8, 2021 11:10

umehrot2 self-requested a review May 8, 2021 11:33

vinothchandar added this to Opened PRs in PR Tracker Board May 10, 2021

vinothchandar self-assigned this May 11, 2021

vinothchandar moved this from Opened PRs to Ready for Review in PR Tracker Board May 11, 2021

nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021

garyli1019 reviewed May 15, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_1879_issue2 branch from 7c720e7 to 5ec8b18 Compare May 20, 2021 06:46

vinothchandar approved these changes May 25, 2021

View reviewed changes

PR Tracker Board automation moved this from Ready for Review to Nearing Landing May 25, 2021

vinothchandar mentioned this pull request May 25, 2021

[HUDI-1491] Support partition pruning for MOR snapshot query #2378

Closed

5 tasks

pengzhiwei2018 force-pushed the dev_1879_issue2 branch from edf6c6b to 6736f3e Compare May 26, 2021 03:15

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table

c4b07cb

pengzhiwei2018 force-pushed the dev_1879_issue2 branch from 7765cce to c4b07cb Compare May 26, 2021 03:20

garyli1019 approved these changes May 29, 2021

View reviewed changes

nsivabalan approved these changes May 29, 2021

View reviewed changes

garyli1019 merged commit dcd7c33 into apache:master May 29, 2021

PR Tracker Board automation moved this from Nearing Landing to Done May 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

pengzhiwei2018 commented May 8, 2021 •

edited

codecov-commenter commented May 8, 2021 •

edited

garyli1019 May 15, 2021

pengzhiwei2018 May 17, 2021

garyli1019 May 19, 2021

pengzhiwei2018 May 20, 2021

vinothchandar left a comment

vinothchandar May 25, 2021

pengzhiwei2018 May 26, 2021

vinothchandar May 25, 2021

pengzhiwei2018 May 26, 2021

garyli1019 left a comment

nsivabalan left a comment

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926

Conversation

pengzhiwei2018 commented May 8, 2021 • edited

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

codecov-commenter commented May 8, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garyli1019 left a comment

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

pengzhiwei2018 commented May 8, 2021 •

edited

codecov-commenter commented May 8, 2021 •

edited