Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement ORCRecordReader #5267

Merged
merged 1 commit into from Apr 17, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

  • Support batch read
  • Support most value types
  • Support read only required fields
  • Enhance tests to cover most value types

@@ -95,25 +171,151 @@ public GenericRow next()
@Override
public GenericRow next(GenericRow reuse)
throws IOException {
_recordReader.nextBatch(_reusableVectorizedRowBatch);
return _recordExtractor.extract(_reusableVectorizedRowBatch, reuse);
int numFields = _orcFields.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive's ORC reader has a zero copy option to boost performance when reading from HDFS -- https://issues.apache.org/jira/browse/HIVE-6347

Secondly, I wonder if we can do something about making this vectorized. Probably not because we anyway have to create a GenericRow and return it for every next() call. However, in general it is possible to read a single column vector at a time from VectorizedRowBatch and avoid the repeated dynamic dispatch in the loop

Say if we do vectorized reads from each column vector from a single VectorizedRowBatch and then pick values from each cell to create a batch of GenericRow, would performance be any different? In that case, the next() API semantics can be changed to bulk based.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we have to materialize the data read immediately into row loses any benefit we could have got from vectorization. May be if we do this in bulk manner, it could change things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not something worth optimizing - unless we can save hours. This code is not used in the query path.

- Support batch read
- Support most value types
- Support read only required fields
- Enhance tests to cover most value types
@Jackie-Jiang Jackie-Jiang merged commit 0622d2c into apache:master Apr 17, 2020
@Jackie-Jiang Jackie-Jiang deleted the orc_record_reader branch April 17, 2020 22:08
mcvsubbu added a commit to mcvsubbu/incubator-pinot that referenced this pull request Jul 1, 2020
Fixing an issue introduced in PR apache#5267
We should not be validating the type of fields that we don't care about.
Cleaned up the messages and exceptions thrown so that we know which
field is the problematic one.
mcvsubbu added a commit that referenced this pull request Jul 13, 2020
* Fix ORC Record reader to ignore extra fields

Fixing an issue introduced in PR #5267
We should not be validating the type of fields that we don't care about.
Cleaned up the messages and exceptions thrown so that we know which
field is the problematic one.

* Update pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
mcvsubbu added a commit that referenced this pull request Jul 15, 2020
* Fix ORC Record reader to ignore extra fields

Fixing an issue introduced in PR #5267
We should not be validating the type of fields that we don't care about.
Cleaned up the messages and exceptions thrown so that we know which
field is the problematic one.

* Update pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants