[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

codope · 2022-11-14T15:53:24Z

Change Logs

Implements FileSliceReader for Hive as specified in #7080. This is still WIP. Please do not merge.

TODO

Models as specified in RFC-64
HoodieRecordMerger implementation for ArrayWritable
Adapt realtime record reader to new merging logic
Schema projection
Filter predicate pushdown
Refactor presto-hudi connector with this new reader
Testing

Impact

High

Risk level (write none, low medium or high below)

High

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

alexeykudinkin · 2022-11-23T20:37:20Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieTableId.java

+
+public class HoodieTableId {
+
+  private final String databaseName;


Just a food for thought: URI scheme in the table naming is usual 3-level "namespace.db.table", we can fold "namespace.db" to just be a qualifier or encode them as standalone concerns depending on how we're envisioning it to be used

alexeykudinkin · 2022-11-23T20:43:43Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveFileSliceReader.java

+
+  @Override
+  public FileSliceReader project(InternalSchema schema) {
+    // should we convert to avro Schema?


Yes, we can just convert it to Avro schema and then use the one you've already impl'd

alexeykudinkin · 2022-11-23T20:45:23Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveFileSliceReader.java

+  public FileSliceReader project(Schema requiredSchema) {
+    String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "");
+    List<String> partitioningFields =
+        partitionFields.length() > 0 ? Arrays.stream(partitionFields.split("/")).collect(Collectors.toList())


Would "/" be a partition-path field separator for Presto as well?

Yes, that's correct.

alexeykudinkin · 2022-11-23T20:49:15Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveFileSliceReader.java

+      if (split instanceof RealtimeSplit) {
+        HoodieMergedLogRecordScanner logRecordScanner = getMergedLogRecordScanner((RealtimeSplit) split, jobConf, readerSchema);
+        while (baseRecordIterator.hasNext()) {
+          logRecordScanner.processNextRecord(baseRecordIterator.next());


I don't think we'd be able to load whole file in memory. Instead, we should create an "merging iterator" going over the base-file and applying updates on the fly (similar to how Spark's one impl'd)

alexeykudinkin · 2022-11-23T20:50:57Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimeMergedRecordReader.java

+
+import static org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMergedLogRecordScanner;
+
+public class HoodieRealtimeMergedRecordReader extends HoodieHiveFileSliceReader


Can we do an aggregation instead of extension here? I'd suggest we simply encapsulate FSR as a field inside instead of extension, unless we're planning to extend its functionality

rebase

codope · 2022-12-19T08:58:47Z

Closing it in favor of #7508
There were quite a few conflicts. It seemed better to fork off latest master and apply my changes.

codope added pr:wip Work in Progress/PRs reader-core labels Nov 14, 2022

codope changed the title ~~[HUDI-5138][WIP][DO NOT MERGE] Table Spek FileSliceReader implementation~~ [HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation Nov 15, 2022

codope force-pushed the rfc-table-format branch from 9c658a9 to 1e10233 Compare November 22, 2022 17:29

alexeykudinkin reviewed Nov 23, 2022

View reviewed changes

codope added 7 commits December 1, 2022 11:12

Add models and file index APIs

618ca36

Add Hive record merger

a38b7a6

Implement getRecordKey and getOrderingVal of HoodieHiveRecord

b0ad3dc

FSR implementation for Hive

3bc31b6

Implement schema projection and add missing models

aa67ca7

Implement project(InternalSchema)

dc4f91a

Skeleton merging iterator

42517b9

rebase

codope force-pushed the rfc-table-format branch from 4ea436e to 42517b9 Compare December 1, 2022 06:55

codope added 3 commits December 5, 2022 16:33

Implement log file iterator based on merging iterator

92b8286

Encapsulate file slice reader in record reader instead of extending

0f99fbd

Make list api return PartitionIncrementalSnapshot

7a5ef1c

codope changed the base branch from release-feature-rfc46 to master December 19, 2022 06:24

codope changed the base branch from master to release-feature-rfc46 December 19, 2022 06:25

codope closed this Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

codope commented Nov 14, 2022

alexeykudinkin Nov 23, 2022

alexeykudinkin Nov 23, 2022

alexeykudinkin Nov 23, 2022

codope Nov 29, 2022

alexeykudinkin Nov 23, 2022

alexeykudinkin Nov 23, 2022

codope commented Dec 19, 2022


		public class HoodieTableId {

		private final String databaseName;


		import static org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMergedLogRecordScanner;

		public class HoodieRealtimeMergedRecordReader extends HoodieHiveFileSliceReader

[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

Conversation

codope commented Nov 14, 2022

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

alexeykudinkin Nov 23, 2022

Choose a reason for hiding this comment

alexeykudinkin Nov 23, 2022

Choose a reason for hiding this comment

alexeykudinkin Nov 23, 2022

Choose a reason for hiding this comment

codope Nov 29, 2022

Choose a reason for hiding this comment

alexeykudinkin Nov 23, 2022

Choose a reason for hiding this comment

alexeykudinkin Nov 23, 2022

Choose a reason for hiding this comment

codope commented Dec 19, 2022