New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement ORCRecordReader #5267
Conversation
Jackie-Jiang
commented
Apr 17, 2020
- Support batch read
- Support most value types
- Support read only required fields
- Enhance tests to cover most value types
@@ -95,25 +171,151 @@ public GenericRow next() | |||
@Override | |||
public GenericRow next(GenericRow reuse) | |||
throws IOException { | |||
_recordReader.nextBatch(_reusableVectorizedRowBatch); | |||
return _recordExtractor.extract(_reusableVectorizedRowBatch, reuse); | |||
int numFields = _orcFields.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive's ORC reader has a zero copy option to boost performance when reading from HDFS -- https://issues.apache.org/jira/browse/HIVE-6347
Secondly, I wonder if we can do something about making this vectorized. Probably not because we anyway have to create a GenericRow and return it for every next() call. However, in general it is possible to read a single column vector at a time from VectorizedRowBatch and avoid the repeated dynamic dispatch in the loop
Say if we do vectorized reads from each column vector from a single VectorizedRowBatch and then pick values from each cell to create a batch of GenericRow, would performance be any different? In that case, the next() API semantics can be changed to bulk based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we have to materialize the data read immediately into row loses any benefit we could have got from vectorization. May be if we do this in bulk manner, it could change things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not something worth optimizing - unless we can save hours. This code is not used in the query path.
- Support batch read - Support most value types - Support read only required fields - Enhance tests to cover most value types
774d909
to
5a80fad
Compare
Fixing an issue introduced in PR apache#5267 We should not be validating the type of fields that we don't care about. Cleaned up the messages and exceptions thrown so that we know which field is the problematic one.
* Fix ORC Record reader to ignore extra fields Fixing an issue introduced in PR #5267 We should not be validating the type of fields that we don't care about. Cleaned up the messages and exceptions thrown so that we know which field is the problematic one. * Update pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com> Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
* Fix ORC Record reader to ignore extra fields Fixing an issue introduced in PR #5267 We should not be validating the type of fields that we don't care about. Cleaned up the messages and exceptions thrown so that we know which field is the problematic one. * Update pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com> Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>