New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table #2926
Conversation
35852ae
to
31e95ca
Compare
Codecov Report
@@ Coverage Diff @@
## master #2926 +/- ##
============================================
+ Coverage 53.00% 55.11% +2.10%
- Complexity 3743 3845 +102
============================================
Files 488 485 -3
Lines 23435 23512 +77
Branches 2500 2519 +19
============================================
+ Hits 12422 12958 +536
+ Misses 9913 9400 -513
- Partials 1100 1154 +54
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@ParameterizedTest | ||
@ValueSource(booleans = Array(true, false)) | ||
def testMORPartitionPrune(partitionEncode: Boolean): Unit = { | ||
val partitions = Array("2021/03/01", "2021/03/02", "2021/03/03", "2021/03/04", "2021/03/05") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the hive style partitioning year=2021/xxx
work as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the hive style partition is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test for hive style as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion! done!
7c720e7
to
5ec8b18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits. LGTM overall. We can land once @umehrot2 also take a pass and @garyli1019 is happy with the change
|
||
// Get partition filter and convert to catalyst expression | ||
val partitionColumns = hoodieFileIndex.partitionSchema.fieldNames.toSet | ||
val partitionFilters= filters.filter(f => f.references.forall(p => partitionColumns.contains(p))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space before =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
* Convert Filters to Catalyst Expressions and joined by And. If convert success return an | ||
* Non-Empty Option[Expression],or else return None. | ||
*/ | ||
def convertToCatalystExpressions(filters: Array[Filter], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we encapsulate this conversion logic into its own class. I could see general use for this, beyond just partition pruning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me!
edf6c6b
to
6736f3e
Compare
7765cce
to
c4b07cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand some parts of the source code as this is new domain to me. but on the whole(looked at the tests), LGTM. and def a good feature to add to hudi. thanks for your contribution :)
What is the purpose of the pull request
Support partition pruned for merge on read snapshot query. This issue was exposed by #2283 which read hudi as spark DataSource table.
Brief change log
filters
inMergeOnReadSnapshotRelation#buildScan
to Catalyst Expression and do the partition prune byHoodieFileIndex
Verify this pull request
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.