New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries #689
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made a first pass.
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
86d9b05
to
d15eb53
Compare
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieHiveUtil.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
Outdated
Show resolved
Hide resolved
hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java
Outdated
Show resolved
Hide resolved
InputPathHandler(Configuration conf, Path[] inputPaths, List<String> incrementalTables) throws IOException { | ||
this.conf = conf; | ||
tableMetaClientMap = new HashMap<>(); | ||
nonIncrementalPaths = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these new datastructures we are introducing compared to the existing code ? What is the implication of these when the number of paths are really large ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They cannot be compared to existing code because existing code doesn't look into InputPaths inside HoodieInputFormat. InputPaths are handled only inside FileInputFormat.
Implication of new data structures -
The InputPathHandler is created once per listStatus() call. Within the InputPathHandler object the three dataStructures (nonIncrementalPaths, incrementalPaths and groupedIncrementalPaths) split the total number of InputPaths among them. At max we can expect totally one entry per InputPath in just one of these structures. The mem constraint will be order of total # InputPaths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so for a table with 400K files performing a snapshot query, we can expect this to be large ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job input paths refers to the partition paths, right? In that case 400K files will map to lesser number of partition paths ?
Made 1 pass and left some comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a high level pass. Looks good overall and can approve once pending comments are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhasudha left 1 comment but rest looks good to me. This is a pretty significant change, could you come up with a test/rollout/rollback plan.
Will do! |
27acc81
to
038a4e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with this per se. But this effectively rewrites hoodie-hadoop-mr :) .
Can you test this in a production settting and share more results before merging?
hoodie-hadoop-mr/src/test/java/com/uber/hoodie/hadoop/HoodieInputFormatTest.java
Outdated
Show resolved
Hide resolved
bdf3ad9
to
19dc57f
Compare
I was able to successfully cross verify the query results between the current HoodieInputFormat and this new HoodieInputFormat for few Uber production tables using spark. I ran different snapshot queries on MOR tables that has count(*), group by's, joins etc. The query latencies were also comparable. For Incremental queries I can't test it yet, without changing the jar in Hive MetaStore. I will be doing that next. My plan is to have that tested in staging and then gradually rolling it to production. |
@bhasudha this looks good overall. We are currently stabilzing master . Wil merge once we are in calmer waters |
19dc57f
to
440c3fb
Compare
@bhasudha Could you please rebase to master and merge it as it is ready? |
2bdd118
to
ade9272
Compare
I rebased to latest master and verified the Hive queries in Docker Demo using the new patch. Verified that all queries in the Demo work as expected and incremental queries leverage optimizations in this patch when hive.fetch.task.conversion is disabled (as desired). I was able to run tests using spark.sql() against some of the production tables (both MOR and COW types). I used --conf spark.sql.hive.convertMetastoreParquet=false so Hive serDe is used instead. Below is a flavor of queries that I tested. The results match between pre-fix and post-fix hudi-spark-bundle jars. Snapshot queriessimple count: non-hudi hudi datasets join: non-hudi non-hudi datasets join: hudi hudi datasets join: group by, order and rank: Incremental queriesspark.sql("select name, count(*) from tableA where event_status = 'complete' and |
Thanks for the update @bhasudha and welcome back :) .. Will make a final pass and then merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhasudha just the one question on pom dependency. Let's resolve that and you can self-merge when ready
…remental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately
ade9272
to
edc1788
Compare
Tests are passing and review commends are addressed. Merging this code in. |
Summary: