Re-implement ORCRecordReader #5267

Jackie-Jiang · 2020-04-17T07:28:02Z

Support batch read
Support most value types
Support read only required fields
Enhance tests to cover most value types

siddharthteotia · 2020-04-17T17:10:19Z

...-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java

@@ -95,25 +171,151 @@ public GenericRow next()
  @Override
  public GenericRow next(GenericRow reuse)
      throws IOException {
-    _recordReader.nextBatch(_reusableVectorizedRowBatch);
-    return _recordExtractor.extract(_reusableVectorizedRowBatch, reuse);
+    int numFields = _orcFields.size();


Hive's ORC reader has a zero copy option to boost performance when reading from HDFS -- https://issues.apache.org/jira/browse/HIVE-6347

Secondly, I wonder if we can do something about making this vectorized. Probably not because we anyway have to create a GenericRow and return it for every next() call. However, in general it is possible to read a single column vector at a time from VectorizedRowBatch and avoid the repeated dynamic dispatch in the loop

Say if we do vectorized reads from each column vector from a single VectorizedRowBatch and then pick values from each cell to create a batch of GenericRow, would performance be any different? In that case, the next() API semantics can be changed to bulk based.

The fact that we have to materialize the data read immediately into row loses any benefit we could have got from vectorization. May be if we do this in bulk manner, it could change things.

This is not something worth optimizing - unless we can save hours. This code is not used in the query path.

- Support batch read - Support most value types - Support read only required fields - Enhance tests to cover most value types

Fixing an issue introduced in PR apache#5267 We should not be validating the type of fields that we don't care about. Cleaned up the messages and exceptions thrown so that we know which field is the problematic one.

* Fix ORC Record reader to ignore extra fields Fixing an issue introduced in PR #5267 We should not be validating the type of fields that we don't care about. Cleaned up the messages and exceptions thrown so that we know which field is the problematic one. * Update pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com> Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>

kishoreg approved these changes Apr 17, 2020

View reviewed changes

siddharthteotia reviewed Apr 17, 2020

View reviewed changes

Re-implement ORCRecordReader

5a80fad

- Support batch read - Support most value types - Support read only required fields - Enhance tests to cover most value types

Jackie-Jiang force-pushed the orc_record_reader branch from 774d909 to 5a80fad Compare April 17, 2020 18:50

Jackie-Jiang merged commit 0622d2c into apache:master Apr 17, 2020

Jackie-Jiang deleted the orc_record_reader branch April 17, 2020 22:08

mcvsubbu mentioned this pull request Jul 1, 2020

Fix ORC Record reader to ignore extra fields #5645

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement ORCRecordReader #5267

Re-implement ORCRecordReader #5267

Jackie-Jiang commented Apr 17, 2020

siddharthteotia Apr 17, 2020 •

edited

siddharthteotia Apr 17, 2020

kishoreg Apr 17, 2020

Re-implement ORCRecordReader #5267

Re-implement ORCRecordReader #5267

Conversation

Jackie-Jiang commented Apr 17, 2020

siddharthteotia Apr 17, 2020 • edited

Choose a reason for hiding this comment

siddharthteotia Apr 17, 2020

Choose a reason for hiding this comment

kishoreg Apr 17, 2020

Choose a reason for hiding this comment

siddharthteotia Apr 17, 2020 •

edited