[HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries #689

bhasudha · 2019-05-23T06:49:49Z

Summary:

listStatus() now classifies inputPaths into incremental, non incremental and non hoodie paths.
Process each of the input paths separately
Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions

vinothchandar

Just made a first pass.

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieHiveUtil.java

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java

n3nash · 2019-06-03T19:14:47Z

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/InputPathHandler.java

+  InputPathHandler(Configuration conf, Path[] inputPaths, List<String> incrementalTables) throws IOException {
+    this.conf = conf;
+    tableMetaClientMap = new HashMap<>();
+    nonIncrementalPaths = new ArrayList<>();


Are these new datastructures we are introducing compared to the existing code ? What is the implication of these when the number of paths are really large ?

They cannot be compared to existing code because existing code doesn't look into InputPaths inside HoodieInputFormat. InputPaths are handled only inside FileInputFormat.

Implication of new data structures -
The InputPathHandler is created once per listStatus() call. Within the InputPathHandler object the three dataStructures (nonIncrementalPaths, incrementalPaths and groupedIncrementalPaths) split the total number of InputPaths among them. At max we can expect totally one entry per InputPath in just one of these structures. The mem constraint will be order of total # InputPaths.

okay, so for a table with 400K files performing a snapshot query, we can expect this to be large ?

The job input paths refers to the partition paths, right? In that case 400K files will map to lesser number of partition paths ?

n3nash · 2019-06-03T19:16:55Z

Made 1 pass and left some comments.

bvaradar

Made a high level pass. Looks good overall and can approve once pending comments are addressed.

n3nash

@bhasudha left 1 comment but rest looks good to me. This is a pretty significant change, could you come up with a test/rollout/rollback plan.

bhasudha · 2019-06-17T18:53:41Z

@bhasudha left 1 comment but rest looks good to me. This is a pretty significant change, could you come up with a test/rollout/rollback plan.

Will do!

vinothchandar

I am good with this per se. But this effectively rewrites hoodie-hadoop-mr :) .
Can you test this in a production settting and share more results before merging?

NOTICE.txt

hoodie-hadoop-mr/src/test/java/com/uber/hoodie/hadoop/HoodieInputFormatTest.java

bhasudha · 2019-07-17T19:06:41Z

I was able to successfully cross verify the query results between the current HoodieInputFormat and this new HoodieInputFormat for few Uber production tables using spark. I ran different snapshot queries on MOR tables that has count(*), group by's, joins etc. The query latencies were also comparable.

For Incremental queries I can't test it yet, without changing the jar in Hive MetaStore. I will be doing that next. My plan is to have that tested in staging and then gradually rolling it to production.

@n3nash @vinothchandar ^^

vinothchandar · 2019-07-18T04:47:50Z

@bhasudha this looks good overall. We are currently stabilzing master . Wil merge once we are in calmer waters

leesf · 2019-12-17T23:44:26Z

@bhasudha Could you please rebase to master and merge it as it is ready?

bhasudha · 2020-01-06T22:48:25Z

I rebased to latest master and verified the Hive queries in Docker Demo using the new patch. Verified that all queries in the Demo work as expected and incremental queries leverage optimizations in this patch when hive.fetch.task.conversion is disabled (as desired).

I was able to run tests using spark.sql() against some of the production tables (both MOR and COW types). I used --conf spark.sql.hive.convertMetastoreParquet=false so Hive serDe is used instead. Below is a flavor of queries that I tested. The results match between pre-fix and post-fix hudi-spark-bundle jars.

Snapshot queries

simple count:
spark.sql("select count(*) from tableA where datestr = '2019-12-10'").show()

non-hudi hudi datasets join:
spark.sql("select m.col1 as colA, t.col2 as colB from table1 m left join table2 as t on t._row_key = m.col1 and t.datestr >= '2016-01-01' join table3 c on m.col4 = c.col5 where c.col6 = 'XYZ'").show()

non-hudi non-hudi datasets join:
spark.sql("select o.col1, count(distinct e.col2) from tableA o join tableB e on o.id = e.id where to_date(e.col1) >= date_sub(current_date, 10) or to_date(e.col3) >= date_sub(current_date, 10) group by 1 order by 2 desc").show()

hudi hudi datasets join:
spark.sql("select t.id, count(t.load) as total_count FROM tableT t LEFT JOIN tableO o on t.id = o.id AND o.datestr > '2019-12-28' AND NOT o.isactive WHERE t.datestr > '2019-12-28' AND NOT t.isactive group by 1 order by 1,2 desc").show()

group by, order and rank:
spark.sql("select * from ( select , rank() over ( partition by rg order by total_items desc ) as row_number from ( select rg, usr, count() as total_items from tableA where date(datestr) >= date('2019-10-11') and date(datestr) < date('2019-10-16') and event = 'complete' and SUBSTRING_INDEX(rg, '.',1) = 'adhoc' group by 1,2 order by 1, count(*) desc ) ) where row_number <= 1").show()

Incremental queries

spark.sql("select name, count(*) from tableA where event_status = 'complete' and _hoodie_commit_time > '20200101235440' group by 1").show()

vinothchandar · 2020-01-07T00:35:45Z

Thanks for the update @bhasudha and welcome back :) .. Will make a final pass and then merge.

vinothchandar

@bhasudha just the one question on pom dependency. Let's resolve that and you can self-merge when ready

hudi-hadoop-mr/pom.xml

…remental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately

bhasudha · 2020-01-08T22:52:47Z

Tests are passing and review commends are addressed. Merging this code in.

vinothchandar requested changes May 24, 2019

View reviewed changes

bhasudha force-pushed the speedup-incremental branch from 86d9b05 to d15eb53 Compare May 24, 2019 22:44

vinothchandar requested changes May 29, 2019

View reviewed changes