Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793

Closed
wants to merge 4 commits into from

Conversation

TeRS-K
Copy link
Contributor

@TeRS-K TeRS-K commented Apr 8, 2021

What is the purpose of the pull request

This pull request supports ORC storage in hudi.

Brief change log

In two separate commits:

  • Implemented HoodieOrcWriter
    • Added HoodieOrcConfigs
    • Added AvroOrcUtils that writes Avro record to VectorizedRowBatch
    • Used orc-core:no-hive module (no-hive is needed because spark-sql uses no-hive version of orc and it would become easier for spark integration)
  • Implemented HoodieOrcReader
    • Read Avro records from VectorizedRowBatch
    • Implemented OrcReaderIterator
    • Implemented ORC utility functions

Verify this pull request

  • Added unit tests for
    • reader/writer creation
    • AvroOrcUtils
  • (local) Wrote a small tool that reads from ORC/Parquet files and writes to ORC/Parquet files, verified that the records in the input/output files are identical using spark.read.orc/spark.read.parquet.
  • (local) Changed the HoodieTableConfig.DEFAULT_BASE_FILE_FORMAT to force the tests to run with ORC as the base format. Some changes need to be made, but I'm leaving it out of this PR to get some initial feedback on the reader/writer implementation first.
    For all tests to pass with ORC as the base file format:
    • Understand schema evolution in ORC (ref TestUpdateSchemaEvolution)
    • Add ORC support for places that have hardcoded ParquetReader or sqlContext.read().parquet()
    • Add ORC support for bootstrap op
    • Hive engine integration with ORC (implement HoodieOrcInputFormat, and more)
    • Spark engine integration with ORC (implement HoodieInternalRowOrcWriter, and more)
    • Add ORC support for HoodieSnapshotExporter
    • Implement HDFSOrcImporter
    • and possibly more.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@n3nash n3nash self-assigned this Apr 8, 2021
@n3nash
Copy link
Contributor

n3nash commented Apr 8, 2021

@prashantwason Can you review this ?

@n3nash n3nash assigned n3nash and unassigned n3nash Apr 8, 2021
@TeRS-K
Copy link
Contributor Author

TeRS-K commented Apr 8, 2021

The build is currently failing with error ERROR: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit, it doesn't seem to be related to my change. How can I trigger a rebuild?

@yanghua
Copy link
Contributor

yanghua commented Apr 9, 2021

How can I trigger a rebuild?

option 1: close and reopen the PR;
option 2: push an empty commit via git command

@TeRS-K TeRS-K closed this Apr 9, 2021
@TeRS-K TeRS-K reopened this Apr 9, 2021

public static final String ORC_FILE_MAX_BYTES = "hoodie.orc.max.file.size";
public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024);
public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments on what is the stripe size used for ?

public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024);
public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size";
public static final String DEFAULT_ORC_STRIPE_SIZE = String.valueOf(64 * 1024 * 1024);
public static final String ORC_BLOCK_SIZE = "hoodie.orc.block.size";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for block size


batch.size++;

if (batch.size == batch.getMaxSize()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this batch size used for ? Can you please add some comments

import org.apache.orc.RecordReader;
import org.apache.orc.TypeDescription;

public class HoodieOrcReader<R extends IndexedRecord> implements HoodieFileReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add corresponding test class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieOrcReader calls methods from OrcUtils and OrcReaderIterator, I think it would suffice to add unit tests to those two classes. Additionally, similar to HoodieParquetReader, which is implicitly tested in many other tests that use the merge handle, and HoodieOrcReader can be tested in this way as well once we find a way to set the base file format for running unit tests. Does this sound reasonable?

/**
* Utility functions for ORC files.
*/
public class OrcUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add corresponding test class to test all public methods

/**
* This class wraps a ORC reader and provides an iterator based api to read from an ORC file.
*/
public class OrcReaderIterator<T> implements Iterator<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding test class

// TRIP_SCHEMA_PREFIX, EXTRA_TYPE_SCHEMA, MAP_TYPE_SCHEMA, FARE_NESTED_SCHEMA, TIP_NESTED_SCHEMA, TRIP_SCHEMA_SUFFIX
// The following types are tested:
// DATE, DECIMAL, LONG, INT, BYTES, ARRAY, RECORD, MAP, STRING, FLOAT, DOUBLE
TypeDescription orcSchema = TypeDescription.fromString("struct<"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this testing all primitive types ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing all avro types that can be converted to. I'm not sure what would be a good test for this as there are so many types, as well as edge cases that are easier to discover with real world datasets. Many of the unit tests uses the reader/writer implicitly, which is largely how I've been testing the ORC reader/writer and catching bugs.
Open to suggestions on how to test AvroOrcUtils. :)

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TeRS-K Left some high level comments

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not sync hudi orc format table to hive metastore yet?

}

@Override
public void writeAvroWithMetadata(R avroRecord, HoodieRecord record) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be moved to the interface (default implementation in interface)? It does not look specific to ORC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved part of this function to the interface since there are still some differences per file format.

));
}

final long millis = time % 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct. Time is in milliseconds so time % 1000 is not microseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it seems like we can remove this line and a couple lines below it as I don't see why millis should need to be accounted in the nanos portion.

@TeRS-K
Copy link
Contributor Author

TeRS-K commented Apr 14, 2021

@leesf

would not sync hudi orc format table to hive metastore yet?

ORC is well-integrated with hive, so hive already has OrcInputFormat, OrcOutputFormat etc. With my latest change to the HoodieInputFormatUtils class, I was able to sync hudi orc format table to hive metastore (tested with deltastreamer).
However, we do still need to implement HoodieOrcInputFormat & HoodieRealtimeOrcInputFormat. I have done some work on that but it's not tested yet.

@codecov-io
Copy link

codecov-io commented Apr 14, 2021

Codecov Report

Merging #2793 (619d75b) into master (6786581) will decrease coverage by 0.99%.
The diff coverage is 8.67%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2793      +/-   ##
============================================
- Coverage     52.54%   51.55%   -1.00%     
- Complexity     3707     3729      +22     
============================================
  Files           485      489       +4     
  Lines         23171    23753     +582     
  Branches       2459     2554      +95     
============================================
+ Hits          12176    12246      +70     
- Misses         9923    10427     +504     
- Partials       1072     1080       +8     
Flag Coverage Δ Complexity Δ
hudicli 40.29% <ø> (ø) 0.00 <ø> (ø)
hudiclient ∅ <ø> (∅) 0.00 <ø> (ø)
hudicommon 48.61% <8.72%> (-2.07%) 0.00 <21.00> (ø)
hudiflink 56.54% <ø> (-0.04%) 0.00 <ø> (ø)
hudihadoopmr 33.27% <0.00%> (-0.18%) 0.00 <0.00> (ø)
hudisparkdatasource 71.33% <ø> (ø) 0.00 <ø> (ø)
hudisync 45.70% <ø> (ø) 0.00 <ø> (ø)
huditimelineservice 64.36% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 69.74% <ø> (+0.01%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...org/apache/hudi/common/util/OrcReaderIterator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ain/java/org/apache/hudi/common/util/OrcUtils.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java 47.23% <0.00%> (-0.73%) 30.00 <0.00> (ø)
...java/org/apache/hudi/common/util/AvroOrcUtils.java 9.11% <9.11%> (ø) 19.00 <19.00> (?)
...va/org/apache/hudi/io/storage/HoodieOrcReader.java 25.00% <25.00%> (ø) 1.00 <1.00> (?)
...org/apache/hudi/common/model/HoodieFileFormat.java 100.00% <100.00%> (ø) 3.00 <0.00> (ø)
...pache/hudi/io/storage/HoodieFileReaderFactory.java 61.53% <100.00%> (+11.53%) 5.00 <1.00> (+2.00)
.../apache/hudi/sink/partitioner/BucketAssigners.java 33.33% <0.00%> (-16.67%) 2.00% <0.00%> (ø%)
...hadoop/realtime/RealtimeCompactedRecordReader.java 64.06% <0.00%> (-8.67%) 13.00% <0.00%> (+1.00%) ⬇️
.../java/org/apache/hudi/common/util/CommitUtils.java 40.47% <0.00%> (-3.12%) 6.00% <0.00%> (ø%)
... and 21 more

@vinothchandar vinothchandar added this to Ready For Review in PR Tracker Board Apr 15, 2021
@n3nash n3nash changed the title [HUDI-57] Support ORC Storage [HUDI-764] [HUDI-765] ORC reader and writer implementation Apr 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 22, 2021

Codecov Report

Merging #2793 (e787d8e) into master (6786581) will increase coverage by 17.14%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2793       +/-   ##
=============================================
+ Coverage     52.54%   69.68%   +17.14%     
+ Complexity     3707      373     -3334     
=============================================
  Files           485       54      -431     
  Lines         23171     1996    -21175     
  Branches       2459      236     -2223     
=============================================
- Hits          12176     1391    -10785     
+ Misses         9923      473     -9450     
+ Partials       1072      132      -940     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ∅ <ø> (∅) 0.00 <ø> (ø)
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 69.68% <ø> (-0.04%) 373.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...org/apache/hudi/utilities/HoodieClusteringJob.java 62.50% <0.00%> (-2.72%) 9.00% <0.00%> (ø%)
...s/deltastreamer/HoodieMultiTableDeltaStreamer.java 78.39% <0.00%> (ø) 18.00% <0.00%> (ø%)
...n/java/org/apache/hudi/cli/commands/SparkMain.java
...di/sink/partitioner/delta/DeltaBucketAssigner.java
...3/internal/HoodieBulkInsertDataInternalWriter.java
...udi/spark3/internal/HoodieWriterCommitMessage.java
...penJ9MemoryLayoutSpecification64bitCompressed.java
...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java
.../common/table/timeline/HoodieArchivedTimeline.java
...g/apache/hudi/common/table/HoodieTableVersion.java
... and 421 more

@nsivabalan nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021
@n3nash
Copy link
Contributor

n3nash commented Jun 10, 2021

Closing this in favor of -> #2999

@n3nash n3nash closed this Jun 10, 2021
PR Tracker Board automation moved this from Under Discussion PRs to Done Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants