[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

TeRS-K · 2021-04-08T18:10:12Z

What is the purpose of the pull request

This pull request supports ORC storage in hudi.

Brief change log

In two separate commits:

Implemented HoodieOrcWriter
- Added HoodieOrcConfigs
- Added AvroOrcUtils that writes Avro record to VectorizedRowBatch
- Used orc-core:no-hive module (no-hive is needed because spark-sql uses no-hive version of orc and it would become easier for spark integration)
Implemented HoodieOrcReader
- Read Avro records from VectorizedRowBatch
- Implemented OrcReaderIterator
- Implemented ORC utility functions

Verify this pull request

Added unit tests for
- reader/writer creation
- AvroOrcUtils
(local) Wrote a small tool that reads from ORC/Parquet files and writes to ORC/Parquet files, verified that the records in the input/output files are identical using spark.read.orc/spark.read.parquet.
(local) Changed the HoodieTableConfig.DEFAULT_BASE_FILE_FORMAT to force the tests to run with ORC as the base format. Some changes need to be made, but I'm leaving it out of this PR to get some initial feedback on the reader/writer implementation first.
For all tests to pass with ORC as the base file format:
- Understand schema evolution in ORC (ref TestUpdateSchemaEvolution)
- Add ORC support for places that have hardcoded ParquetReader or sqlContext.read().parquet()
- Add ORC support for bootstrap op
- Hive engine integration with ORC (implement HoodieOrcInputFormat, and more)
- Spark engine integration with ORC (implement HoodieInternalRowOrcWriter, and more)
- Add ORC support for HoodieSnapshotExporter
- Implement HDFSOrcImporter
- and possibly more.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

n3nash · 2021-04-08T18:19:42Z

@prashantwason Can you review this ?

TeRS-K · 2021-04-08T20:31:57Z

The build is currently failing with error ERROR: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit, it doesn't seem to be related to my change. How can I trigger a rebuild?

yanghua · 2021-04-09T07:58:22Z

How can I trigger a rebuild?

option 1: close and reopen the PR;
option 2: push an empty commit via git command

n3nash · 2021-04-09T21:18:54Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java

+
+  public static final String ORC_FILE_MAX_BYTES = "hoodie.orc.max.file.size";
+  public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024);
+  public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size";


Can you please add comments on what is the stripe size used for ?

n3nash · 2021-04-09T21:19:08Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java

+  public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024);
+  public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size";
+  public static final String DEFAULT_ORC_STRIPE_SIZE = String.valueOf(64 * 1024 * 1024);
+  public static final String ORC_BLOCK_SIZE = "hoodie.orc.block.size";


Same for block size

n3nash · 2021-04-09T21:22:28Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieOrcWriter.java

+
+    batch.size++;
+
+    if (batch.size == batch.getMaxSize()) {


What is this batch size used for ? Can you please add some comments

n3nash · 2021-04-09T21:27:22Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieOrcReader.java

+import org.apache.orc.RecordReader;
+import org.apache.orc.TypeDescription;
+
+public class HoodieOrcReader<R extends IndexedRecord> implements HoodieFileReader {


Add corresponding test class

HoodieOrcReader calls methods from OrcUtils and OrcReaderIterator, I think it would suffice to add unit tests to those two classes. Additionally, similar to HoodieParquetReader, which is implicitly tested in many other tests that use the merge handle, and HoodieOrcReader can be tested in this way as well once we find a way to set the base file format for running unit tests. Does this sound reasonable?

n3nash · 2021-04-09T21:27:38Z

hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java

+/**
+ * Utility functions for ORC files.
+ */
+public class OrcUtils {


Add corresponding test class to test all public methods

n3nash · 2021-04-09T21:27:47Z

hudi-common/src/main/java/org/apache/hudi/common/util/OrcReaderIterator.java

+/**
+ * This class wraps a ORC reader and provides an iterator based api to read from an ORC file.
+ */
+public class OrcReaderIterator<T> implements Iterator<T> {


Corresponding test class

n3nash · 2021-04-09T21:28:08Z

hudi-common/src/test/java/org/apache/hudi/common/util/TestAvroOrcUtils.java

+    // TRIP_SCHEMA_PREFIX, EXTRA_TYPE_SCHEMA, MAP_TYPE_SCHEMA, FARE_NESTED_SCHEMA, TIP_NESTED_SCHEMA, TRIP_SCHEMA_SUFFIX
+    // The following types are tested:
+    // DATE, DECIMAL, LONG, INT, BYTES, ARRAY, RECORD, MAP, STRING, FLOAT, DOUBLE
+    TypeDescription orcSchema = TypeDescription.fromString("struct<"


Is this testing all primitive types ?

This is testing all avro types that can be converted to. I'm not sure what would be a good test for this as there are so many types, as well as edge cases that are easier to discover with real world datasets. Many of the unit tests uses the reader/writer implicitly, which is largely how I've been testing the ORC reader/writer and catching bugs.
Open to suggestions on how to test AvroOrcUtils. :)

n3nash

@TeRS-K Left some high level comments

leesf

would not sync hudi orc format table to hive metastore yet?

prashantwason · 2021-04-12T21:54:04Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieOrcWriter.java

+  }
+
+  @Override
+  public void writeAvroWithMetadata(R avroRecord, HoodieRecord record) throws IOException {


Can this function be moved to the interface (default implementation in interface)? It does not look specific to ORC.

I moved part of this function to the interface since there are still some differences per file format.

prashantwason · 2021-04-12T22:05:18Z

hudi-common/src/main/java/org/apache/hudi/common/util/AvroOrcUtils.java

+          ));
+        }
+
+        final long millis = time % 1000;


This does not look correct. Time is in milliseconds so time % 1000 is not microseconds.

Right, it seems like we can remove this line and a couple lines below it as I don't see why millis should need to be accounted in the nanos portion.

TeRS-K · 2021-04-14T05:42:41Z

@leesf

would not sync hudi orc format table to hive metastore yet?

ORC is well-integrated with hive, so hive already has OrcInputFormat, OrcOutputFormat etc. With my latest change to the HoodieInputFormatUtils class, I was able to sync hudi orc format table to hive metastore (tested with deltastreamer).
However, we do still need to implement HoodieOrcInputFormat & HoodieRealtimeOrcInputFormat. I have done some work on that but it's not tested yet.

codecov-io · 2021-04-14T06:30:06Z

Codecov Report

Merging #2793 (619d75b) into master (6786581) will decrease coverage by 0.99%.
The diff coverage is 8.67%.

@@             Coverage Diff              @@
##             master    #2793      +/-   ##
============================================
- Coverage     52.54%   51.55%   -1.00%     
- Complexity     3707     3729      +22     
============================================
  Files           485      489       +4     
  Lines         23171    23753     +582     
  Branches       2459     2554      +95     
============================================
+ Hits          12176    12246      +70     
- Misses         9923    10427     +504     
- Partials       1072     1080       +8

Flag	Coverage Δ	Complexity Δ
hudicli	`40.29% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`∅ <ø> (∅)`	`0.00 <ø> (ø)`
hudicommon	`48.61% <8.72%> (-2.07%)`	`0.00 <21.00> (ø)`
hudiflink	`56.54% <ø> (-0.04%)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.27% <0.00%> (-0.18%)`	`0.00 <0.00> (ø)`
hudisparkdatasource	`71.33% <ø> (ø)`	`0.00 <ø> (ø)`
hudisync	`45.70% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`64.36% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.74% <ø> (+0.01%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...org/apache/hudi/common/util/OrcReaderIterator.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...ain/java/org/apache/hudi/common/util/OrcUtils.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java	`47.23% <0.00%> (-0.73%)`	`30.00 <0.00> (ø)`
...java/org/apache/hudi/common/util/AvroOrcUtils.java	`9.11% <9.11%> (ø)`	`19.00 <19.00> (?)`
...va/org/apache/hudi/io/storage/HoodieOrcReader.java	`25.00% <25.00%> (ø)`	`1.00 <1.00> (?)`
...org/apache/hudi/common/model/HoodieFileFormat.java	`100.00% <100.00%> (ø)`	`3.00 <0.00> (ø)`
...pache/hudi/io/storage/HoodieFileReaderFactory.java	`61.53% <100.00%> (+11.53%)`	`5.00 <1.00> (+2.00)`
.../apache/hudi/sink/partitioner/BucketAssigners.java	`33.33% <0.00%> (-16.67%)`	`2.00% <0.00%> (ø%)`
...hadoop/realtime/RealtimeCompactedRecordReader.java	`64.06% <0.00%> (-8.67%)`	`13.00% <0.00%> (+1.00%)`	⬇️
.../java/org/apache/hudi/common/util/CommitUtils.java	`40.47% <0.00%> (-3.12%)`	`6.00% <0.00%> (ø%)`
... and 21 more

codecov-commenter · 2021-04-22T22:24:45Z

Codecov Report

Merging #2793 (e787d8e) into master (6786581) will increase coverage by 17.14%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #2793       +/-   ##
=============================================
+ Coverage     52.54%   69.68%   +17.14%     
+ Complexity     3707      373     -3334     
=============================================
  Files           485       54      -431     
  Lines         23171     1996    -21175     
  Branches       2459      236     -2223     
=============================================
- Hits          12176     1391    -10785     
+ Misses         9923      473     -9450     
+ Partials       1072      132      -940

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`∅ <ø> (∅)`	`0.00 <ø> (ø)`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.68% <ø> (-0.04%)`	`373.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...org/apache/hudi/utilities/HoodieClusteringJob.java	`62.50% <0.00%> (-2.72%)`	`9.00% <0.00%> (ø%)`
...s/deltastreamer/HoodieMultiTableDeltaStreamer.java	`78.39% <0.00%> (ø)`	`18.00% <0.00%> (ø%)`
...n/java/org/apache/hudi/cli/commands/SparkMain.java
...di/sink/partitioner/delta/DeltaBucketAssigner.java
...3/internal/HoodieBulkInsertDataInternalWriter.java
...udi/spark3/internal/HoodieWriterCommitMessage.java
...penJ9MemoryLayoutSpecification64bitCompressed.java
...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java
.../common/table/timeline/HoodieArchivedTimeline.java
...g/apache/hudi/common/table/HoodieTableVersion.java
... and 421 more

n3nash · 2021-06-10T21:07:37Z

Closing this in favor of -> #2999

n3nash self-assigned this Apr 8, 2021

n3nash assigned n3nash and unassigned n3nash Apr 8, 2021

TeRS-K closed this Apr 9, 2021

TeRS-K reopened this Apr 9, 2021

TeRS-K added 2 commits April 9, 2021 13:45

[HUDI-764] Implement HoodieOrcWriter (ref datacollector)

52da28b

[HUDI-765] Implement HoodieOrcReader

c610a05

TeRS-K force-pushed the HUDI-57 branch from 606e223 to c610a05 Compare April 9, 2021 17:46

n3nash reviewed Apr 9, 2021

View reviewed changes

n3nash requested changes Apr 9, 2021

View reviewed changes

leesf reviewed Apr 11, 2021

View reviewed changes

prashantwason reviewed Apr 12, 2021

View reviewed changes

Add comments + refactor

619d75b

vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021

n3nash changed the title ~~[HUDI-57] Support ORC Storage~~ [HUDI-764] [HUDI-765] ORC reader and writer implementation Apr 15, 2021

Add methods to OrcUtils

e787d8e

nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021

n3nash closed this Jun 10, 2021

PR Tracker Board automation moved this from Under Discussion PRs to Done Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

TeRS-K commented Apr 8, 2021 •

edited

n3nash commented Apr 8, 2021

TeRS-K commented Apr 8, 2021

yanghua commented Apr 9, 2021

n3nash Apr 9, 2021

n3nash Apr 9, 2021

n3nash Apr 9, 2021

n3nash Apr 9, 2021

TeRS-K Apr 14, 2021

n3nash Apr 9, 2021

n3nash Apr 9, 2021

n3nash Apr 9, 2021

TeRS-K Apr 14, 2021

n3nash left a comment

leesf left a comment

prashantwason Apr 12, 2021

TeRS-K Apr 14, 2021

prashantwason Apr 12, 2021

TeRS-K Apr 14, 2021

TeRS-K commented Apr 14, 2021

codecov-io commented Apr 14, 2021 •

edited

codecov-commenter commented Apr 22, 2021 •

edited

n3nash commented Jun 10, 2021

[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

Conversation

TeRS-K commented Apr 8, 2021 • edited

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

n3nash commented Apr 8, 2021

TeRS-K commented Apr 8, 2021

yanghua commented Apr 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash left a comment

Choose a reason for hiding this comment

leesf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TeRS-K commented Apr 14, 2021

codecov-io commented Apr 14, 2021 • edited

Codecov Report

codecov-commenter commented Apr 22, 2021 • edited

Codecov Report

n3nash commented Jun 10, 2021

TeRS-K commented Apr 8, 2021 •

edited

codecov-io commented Apr 14, 2021 •

edited

codecov-commenter commented Apr 22, 2021 •

edited