New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-764] [HUDI-765] ORC reader and writer implementation #2793
Conversation
@prashantwason Can you review this ? |
The build is currently failing with error |
option 1: close and reopen the PR; |
|
||
public static final String ORC_FILE_MAX_BYTES = "hoodie.orc.max.file.size"; | ||
public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024); | ||
public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add comments on what is the stripe size used for ?
public static final String DEFAULT_ORC_FILE_MAX_BYTES = String.valueOf(120 * 1024 * 1024); | ||
public static final String ORC_STRIPE_SIZE = "hoodie.orc.stripe.size"; | ||
public static final String DEFAULT_ORC_STRIPE_SIZE = String.valueOf(64 * 1024 * 1024); | ||
public static final String ORC_BLOCK_SIZE = "hoodie.orc.block.size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for block size
|
||
batch.size++; | ||
|
||
if (batch.size == batch.getMaxSize()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this batch size used for ? Can you please add some comments
import org.apache.orc.RecordReader; | ||
import org.apache.orc.TypeDescription; | ||
|
||
public class HoodieOrcReader<R extends IndexedRecord> implements HoodieFileReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add corresponding test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HoodieOrcReader calls methods from OrcUtils and OrcReaderIterator, I think it would suffice to add unit tests to those two classes. Additionally, similar to HoodieParquetReader, which is implicitly tested in many other tests that use the merge handle, and HoodieOrcReader can be tested in this way as well once we find a way to set the base file format for running unit tests. Does this sound reasonable?
/** | ||
* Utility functions for ORC files. | ||
*/ | ||
public class OrcUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add corresponding test class to test all public methods
/** | ||
* This class wraps a ORC reader and provides an iterator based api to read from an ORC file. | ||
*/ | ||
public class OrcReaderIterator<T> implements Iterator<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corresponding test class
// TRIP_SCHEMA_PREFIX, EXTRA_TYPE_SCHEMA, MAP_TYPE_SCHEMA, FARE_NESTED_SCHEMA, TIP_NESTED_SCHEMA, TRIP_SCHEMA_SUFFIX | ||
// The following types are tested: | ||
// DATE, DECIMAL, LONG, INT, BYTES, ARRAY, RECORD, MAP, STRING, FLOAT, DOUBLE | ||
TypeDescription orcSchema = TypeDescription.fromString("struct<" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this testing all primitive types ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing all avro types that can be converted to. I'm not sure what would be a good test for this as there are so many types, as well as edge cases that are easier to discover with real world datasets. Many of the unit tests uses the reader/writer implicitly, which is largely how I've been testing the ORC reader/writer and catching bugs.
Open to suggestions on how to test AvroOrcUtils. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TeRS-K Left some high level comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not sync hudi orc format table to hive metastore yet?
} | ||
|
||
@Override | ||
public void writeAvroWithMetadata(R avroRecord, HoodieRecord record) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function be moved to the interface (default implementation in interface)? It does not look specific to ORC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved part of this function to the interface since there are still some differences per file format.
)); | ||
} | ||
|
||
final long millis = time % 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look correct. Time is in milliseconds so time % 1000 is not microseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it seems like we can remove this line and a couple lines below it as I don't see why millis should need to be accounted in the nanos portion.
ORC is well-integrated with hive, so hive already has OrcInputFormat, OrcOutputFormat etc. With my latest change to the HoodieInputFormatUtils class, I was able to sync hudi orc format table to hive metastore (tested with deltastreamer). |
Codecov Report
@@ Coverage Diff @@
## master #2793 +/- ##
============================================
- Coverage 52.54% 51.55% -1.00%
- Complexity 3707 3729 +22
============================================
Files 485 489 +4
Lines 23171 23753 +582
Branches 2459 2554 +95
============================================
+ Hits 12176 12246 +70
- Misses 9923 10427 +504
- Partials 1072 1080 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Codecov Report
@@ Coverage Diff @@
## master #2793 +/- ##
=============================================
+ Coverage 52.54% 69.68% +17.14%
+ Complexity 3707 373 -3334
=============================================
Files 485 54 -431
Lines 23171 1996 -21175
Branches 2459 236 -2223
=============================================
- Hits 12176 1391 -10785
+ Misses 9923 473 -9450
+ Partials 1072 132 -940
Flags with carried forward coverage won't be shown. Click here to find out more. |
Closing this in favor of -> #2999 |
What is the purpose of the pull request
This pull request supports ORC storage in hudi.
Brief change log
In two separate commits:
no-hive
is needed because spark-sql uses no-hive version of orc and it would become easier for spark integration)Verify this pull request
For all tests to pass with ORC as the base file format:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.