Skip to content

[HUDI-4364]: changes for integrating column stats index into presto-h…#6087

Draft
pratyakshsharma wants to merge 2 commits intoapache:masterfrom
pratyakshsharma:hudi-4364
Draft

[HUDI-4364]: changes for integrating column stats index into presto-h…#6087
pratyakshsharma wants to merge 2 commits intoapache:masterfrom
pratyakshsharma:hudi-4364

Conversation

@pratyakshsharma
Copy link
Contributor

…udi connector

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@pratyakshsharma pratyakshsharma marked this pull request as draft July 12, 2022 10:30
@pratyakshsharma pratyakshsharma requested a review from codope July 12, 2022 10:31
@codope codope self-assigned this Jul 12, 2022
@codope codope added reader-core area:query-engine Query engine integrations labels Jul 12, 2022
@pratyakshsharma
Copy link
Contributor Author

@hudi-bot run azure

@xiarixiaoyao
Copy link
Contributor

@pratyakshsharma nice work!
A little question:
Why don't we put those codes into Presto Hudi connector so that we can reuse related classes of Presto directly

@pratyakshsharma
Copy link
Contributor Author

pratyakshsharma commented Jul 12, 2022

@xiarixiaoyao this is a good question, something I have been thinking about too. The idea is to build a layer that will help in integrating column stats index with all java based engines like presto, trino and hive. This lays the foundation, since we need something like ranges or column domains so as to be able to filter the files using min and max values. Few classes here are actually inspired from those present in presto, but they are not exactly similar.
We can end up writing this logic in presto, but then a similar work will have to be done for trino as well. With this piece of code in Hudi, we just need an adapter in presto/trino to be able to call the api exposed for filtering files in this PR.

Although since this is just the beginning of this work, I am open to hear others' thoughts on this.

@codope
Copy link
Member

codope commented Jul 12, 2022

cc @alexeykudinkin

@alexeykudinkin
Copy link
Contributor

@pratyakshsharma thanks for taking the time to contribute this!

We definitely want to make sure that the code integrating w/ Presto/Trino/Hive is reusable as much as possible, and i think we should start to think about it upfront to avoid churn of refactoring things back and forth. Given the scope of this integration as well as its impact, i think we'd def go for RFC for it to make sure we solicit the feedback from the community before go too far w/ the implementation.

@pratyakshsharma
Copy link
Contributor Author

pratyakshsharma commented Jul 14, 2022

@alexeykudinkin An epic is filed here - https://issues.apache.org/jira/browse/HUDI-4394.

Please note this draft PR is intended as a POC and would work well with Presto. We were actually planning to get this with 0.12 release. If not, we can target this for 1.0.0

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin
Copy link
Contributor

Got it. Yeah, i don't think we'll be able to make it into 0.12 given that we're planning to do a code freeze next week.

And again, i don't think we can go with the project of this size, scope and more importantly impact (it'll be affecting all forthcoming execution engines like Flink, Presto, Trino, Hive, etc) w/o an RFC.

@pratyakshsharma
Copy link
Contributor Author

Agree with you on this. Let me draft an RFC and we can take it up from there.

@prasannarajaperumal prasannarajaperumal self-requested a review August 17, 2022 07:10
@prasannarajaperumal prasannarajaperumal self-assigned this Aug 17, 2022
@yihua yihua added the priority:blocker Production down; release blocker label Sep 13, 2022
@codope codope added priority:high Significant impact; potential bugs and removed priority:blocker Production down; release blocker labels Sep 16, 2022
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:query-engine Query engine integrations big-needle-movers priority:high Significant impact; potential bugs size:L PR with lines of changes in (300, 1000]

Projects

Status: 👤 User Action
Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

8 participants