[HUDI-6801] Implement merging partial updates from log files for MOR tables#9883
[HUDI-6801] Implement merging partial updates from log files for MOR tables#9883yihua merged 23 commits intoapache:masterfrom
Conversation
7eb16f5 to
2de35c7
Compare
a54b000 to
c140ff4
Compare
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java
Outdated
Show resolved
Hide resolved
...-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
Outdated
Show resolved
Hide resolved
...ient/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java
Outdated
Show resolved
Hide resolved
085a858 to
0b36862
Compare
...ient/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java
Show resolved
Hide resolved
| Map<String, Object> meta = new HashMap<>(); | ||
| meta.put(INTERNAL_META_RECORD_KEY, getRecordKey(record, schema)); | ||
| meta.put(INTERNAL_META_SCHEMA, schema); | ||
| meta.put(INTERNAL_META_IS_PARTIAL, isPartial); |
There was a problem hiding this comment.
I'm wondering whether we can represent the metadata as a POJO to make the interface more explicit and clear.
There was a problem hiding this comment.
We can. We can take it up in a separate PR.
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
Outdated
Show resolved
Hide resolved
| String recordKey = readerContext.getRecordKey(baseRecord, baseFileSchema); | ||
| Pair<Option<T>, Map<String, Object>> logRecordInfo = records.remove(recordKey); | ||
| Map<String, Object> metadata = readerContext.generateMetadataForRecord( | ||
| baseRecord, baseFileSchema, false); |
There was a problem hiding this comment.
Caution for the performace regression for per-record metadata construction.
There was a problem hiding this comment.
We'll do benchmarking on this. Likely it's ok given we're not doing much work inside the method and the existing log reader also extract some metadata for each record.
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java
Outdated
Show resolved
Hide resolved
...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala
Outdated
Show resolved
Hide resolved
78480b5 to
48be011
Compare
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java
Show resolved
Hide resolved
| boolean isValid = data == null || data instanceof UnsafeRow | ||
| || schema != null && (data instanceof HoodieInternalRow || SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data)); | ||
| || schema != null && (data instanceof HoodieInternalRow | ||
| || data instanceof GenericInternalRow |
There was a problem hiding this comment.
The original indentation is more clear.
There was a problem hiding this comment.
The line is too long. I've revised the indentation to make it more clear.
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedFileGroupRecordBuffer.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java
Outdated
Show resolved
Hide resolved
danny0405
left a comment
There was a problem hiding this comment.
+1, let's supplement with more tests if it is necessary.
…Reader to read parquet log blocks
…tables (apache#9883) This commit adds the logic of merging partial updates in the new file group reader with Spark record merger.
Change Logs
This PR adds the logic of merging partial updates in the new file group reader with Spark record merger.
Impact
Supports snapshot queries on file groups with partial updates.
Risk level
low
Documentation Update
N/A
Contributor's checklist