Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-751: [C++] Implement Predicate Pushdown for C++ Reader #476

Closed
wants to merge 53 commits into from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 4, 2020

  1. Use RowReaderOptions to pass SearchArgument to enable PPD.
  2. Modify RowReaderImpl::startNextStripe to seek to next matched row group
    based on PPD.
  3. RowReaderImpl::next seeks to next matched row group based on PPD result.
  4. RowReaderImpl::seekToRow also jumps to the 1st matched row group after
    target row.

1. Use RowReaderOptions to pass SearchArgument to enable PPD.
2. Modify RowReaderImpl::startNextStripe to seek to next matched row group
   based on PPD.
3. RowReaderImpl::next seeks to next matched row group based on PPD result.
4. RowReaderImpl::seekToRow also jumps to the 1st matched row group after
   target row.
@wgtmac
Copy link
Member Author

wgtmac commented Feb 4, 2020

@stiga-huang Can you help review and verify this patch in Impala? Thanks!

@luksan47
Copy link
Contributor

luksan47 commented Mar 4, 2020

Hi, @wgtmac, I've started to implement predicate pushdown in Impala, will test the PR that way.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 5, 2020

@luksan47 cool! Let me know if you have any feedback.

@csringhofer
Copy link

I reviewed Norbert's implementation for Impala (https://gerrit.cloudera.org/#/c/15403/) and found some problems that are very hard to implement on the client side:
https://issues.apache.org/jira/browse/ORC-611
https://issues.apache.org/jira/browse/ORC-612

The issues are related to types that are generally problematic (TIMESTAMP, CHAR, VARCHAR). It is possible to disable predicate pushdown for this types, but in the long term it would be good to have a proper solution.

Can you look at the issues? I think that it would be much better to implement them in the ORC lib, because it is easy to create buggy implementations in clients.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 17, 2020

@csringhofer Thanks for letting me know! I have also noticed that I need to use lowerbound & upperbound in TimestampColumnStatistics to handle timezone difference. Will take a deeper look into these issues.

// it is guaranteed to be at start of a row group
currentRowInStripe = nextRowToRead;
if (currentRowInStripe < rowsInCurrentStripe) {
seekToRowGroup(static_cast<uint32_t>(currentRowInStripe / footer->rowindexstride()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is a hidden performance regression here:
at the end of the call chain, e.g. ColumnReader::seekToRowGroup(), RleDecoderV2::seek(), ZlibDecompressionStream::seek(), the inputBuffer is reset to nullptr:

inputBuffer = nullptr;

This leads to calling ZlibDecompressionStream::readBuffer() and reading from the stream again, even if the whole stream was buffered previously to inputBuffer (which is usually the case in Impala, as we set the block size to 8 MB, but can also cause problems with the default 256KB).
In the worst case this can lead to reading the whole column O(num_of_row_groups) times.

The solution could be to make ZlibDecompressionStream/BlockDecompressionStream::seek() smarter and keep the inputBuffer if the new position (which is the header byte of the new compression block) is still in it.

A further improvement could be to make ZlibDecompressionStream/BlockDecompressionStream::Skip() smarter to avoid decompressing all the skipped blocks in this case (

// this is a stupid implementation for now.
), but this is not a regression so can go to another patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Performance of seeking is another headache if you use large compression blocks. In our internal ORC setup, we have optmized this by finishing compression blocks at the end of every row group. In this way, seeking will never read compression blocks of other row groups.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a Jira (ORC-614) for the optimization of seek(), and we put together an implementation with Norbert, he will upload it soon as pull request. It seems to solve the regressions we experienced.

About finishing compression blocks at row group end: I actually created a ticket (IMPALA-8449) to do something similar in Parquet:
In Parquet we came from the other direction - pages are both the unit of indexing and compression in Parquet, but they are not aligned, so unlike ORC, index entries can point to different row numbers for every column. This decreases decreases the efficiency of combining filters for different columns + makes the reader implementation very complex.

luksan47 added a commit to luksan47/orc that referenced this pull request Mar 25, 2020
The current implementation of ZlibDecompressionStream::seek and
BlockDecompressionStream::seek resets the state of the decompressor
and the underlying file reader and throws away their buffers.

This commit introduces two optimizations which rely on reusing
the buffers that still contain useful data, and therefore reducing
the time spent reading/uncompressing the buffers again.

The first case is when the seeked position is already read from
the input stream, but has not been decompressed yet, ie. it's
not in the output stream.

The second case is when the seeked position is already read
and decompressed into the output stream.

Moved the common data of the decompression streams into a common
class.

Tests:
 - Run the ORC tests, and the Impala tests working on ORC tables.
 - The regression that apache#476 would cause is not present anymore.
luksan47 added a commit to luksan47/orc that referenced this pull request Mar 25, 2020
The current implementation of ZlibDecompressionStream::seek and
BlockDecompressionStream::seek resets the state of the decompressor
and the underlying file reader and throws away their buffers.

This commit introduces two optimizations which rely on reusing
the buffers that still contain useful data, and therefore reducing
the time spent reading/uncompressing the buffers again.

The first case is when the seeked position is already read from
the input stream, but has not been decompressed yet, ie. it's
not in the output stream.

The second case is when the seeked position is already read
and decompressed into the output stream.

Moved the common data of the decompression streams into a common
class.

Tests:
 - Run the ORC tests, and the Impala tests working on ORC tables.
 - The regression that apache#476 would cause is not present anymore.
luksan47 added a commit to luksan47/orc that referenced this pull request Mar 25, 2020
The current implementation of ZlibDecompressionStream::seek and
BlockDecompressionStream::seek resets the state of the decompressor
and the underlying file reader and throws away their buffers.

This commit introduces two optimizations which rely on reusing
the buffers that still contain useful data, and therefore reducing
the time spent reading/uncompressing the buffers again.

The first case is when the seeked position is already read from
the input stream, but has not been decompressed yet, ie. it's
not in the output stream.

The second case is when the seeked position is already read
and decompressed into the output stream.

Moved the common data of the decompression streams into a common
class.

Tests:
 - Run the ORC tests, and the Impala tests working on ORC tables.
 - The regression that apache#476 would cause is not present anymore.
luksan47 added a commit to luksan47/orc that referenced this pull request Mar 25, 2020
The current implementation of ZlibDecompressionStream::seek and
BlockDecompressionStream::seek resets the state of the decompressor
and the underlying file reader and throws away their buffers.

This commit introduces two optimizations which rely on reusing
the buffers that still contain useful data, and therefore reducing
the time spent reading/uncompressing the buffers again.

The first case is when the seeked position is already read
and decompressed into the output stream.

The second case is when the seeked position is already read from
the input stream, but has not been decompressed yet, ie. it's
not in the output stream.

Tests:
 - Run the ORC tests, and the Impala tests working on ORC tables.
 - The regression that apache#476 would cause is not present anymore.
1. add stats to reflect ppd effectiveness.
2. disable ppd for char and varchar types.
@wgtmac
Copy link
Member Author

wgtmac commented Sep 8, 2020

@csringhofer @luksan47 @stiga-huang I have changed this PR to reflect the feedback. The fix for timestamp is addressed by a separate PR: #543. Please review it again. Sorry for the long delay.

dongjoon-hyun and others added 12 commits December 10, 2020 15:30
### What changes were proposed in this pull request?

Consistent TypeDescription handling for quoted field names

### Why are the changes needed?

SARGs failing due to incorrect handling of quoted fieldNames

### How was this patch tested?

TestVectorOrcFile.testQuotedPredicatePushdown
### What changes were proposed in this pull request?

This PR updates the scan tool to print information about where the file is corrupted. It
* reads data by batches until there is a problem
* tries re-reading that batch column by column to find which column is corrupted
* figures out the next location that the reader can seek to

### Why are the changes needed?

It helps diagnose where (row & column) an ORC file is corrupted.

### How was this patch tested?

It was tested on ORC files that were corrupted by bad machines.
Signed-off-by: Owen O'Malley <omalley@apache.org>
…messages optional. (apache#583)

### What changes were proposed in this pull request?

- Change the stripe id to be 0 based, which seems more consistent.
- Change the end of file stripe id to be the number of stripes rather than -1.
- Limit the recovery row to the start of the next stripe.
- Add --verbose that prints exceptions.
- Fix the --help to give more information

### How was this patch tested?

Tested by hand on the example files.

Signed-off-by: Owen O'Malley <omalley@apache.org>
### What changes were proposed in this pull request?

Use 64-bit versions of fstat to support files bigger than 2GB in Windows.

### Why are the changes needed?

To support big files.

### How was this patch tested?

Manual
### What changes were proposed in this pull request?

This PR aims to only publish snapshots at apache orc repo by adding an if statement. 

### Why are the changes needed?

To save the resources at the downstream.

### How was this patch tested?

N/A
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

This PR updates links for Trino.

### Why are the changes needed?

PrestoSQL has been renamed to Trino. See: https://trino.io/blog/2020/12/27/announcing-trino.html

### How was this patch tested?
N/A
apache#588)

### What changes were proposed in this pull request?
RecordReaderImp should pass down the writer calendar info (writerUsedProlepticGregorian) when evaluating predicates to make sure column stats are properly deserialized (affects TimestampStatistics)


### Why are the changes needed?
Correct evaluation of predicates with Timestamps


### How was this patch tested?
TestRecordReaderImpl.testPredEvalTimestampStatsDiffWriter
### What changes were proposed in this pull request?
Special ConvertTreeReader for Boolean using StringGroupFromAnyIntegerTreeReader for String/Char/Varchar types

### Why are the changes needed?
Properly handle Boolean to String/Char/Varchar conversions

### How was this patch tested?
TestSchemaEvolution.testBooleanToStringEvolution
findepi and others added 4 commits February 7, 2021 23:00
### What changes were proposed in this pull request?

This PR aims to add a writer version for Trino.

### Why are the changes needed?

Trino writer is different from Presto writer.

### How was this patch tested?

N/A
Fixes apache#630

Signed-off-by: Owen O'Malley <omalley@apache.org>
Signed-off-by: Owen O'Malley <omalley@apache.org>
Fixes apache#640

Signed-off-by: Owen O'Malley <omalley@apache.org>
@wgtmac
Copy link
Member Author

wgtmac commented Feb 16, 2021

@dongjoon-hyun @pgaref This PR has been sleeping for a decade. Really appreciate it if you can help reviewing it once get the chance. Thanks!

Copy link
Contributor

@pgaref pgaref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wgtmac thanks for the reminder!
Just went through the code and left some comments -- I believe we should add some more Tests here to increase confidence. Few example include: testPredEvalWithBooleanStats, testPredEvalWithIntStats etc.

public void testPredEvalWithBooleanStats() throws Exception {

On another note, ORC-40 seems to be about building SearchArgument so we should move this to a new ticket -- open ORC-751 for this

c++/include/orc/Reader.hh Outdated Show resolved Hide resolved
c++/src/Reader.cc Show resolved Hide resolved
c++/src/Reader.cc Outdated Show resolved Hide resolved
c++/src/Reader.cc Outdated Show resolved Hide resolved
c++/src/sargs/SargsApplier.cc Show resolved Hide resolved
c++/test/TestPredicatePushdown.cc Show resolved Hide resolved
c++/test/TestPredicatePushdown.cc Show resolved Hide resolved
c++/test/TestPredicatePushdown.cc Show resolved Hide resolved
c++/src/Reader.cc Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Feb 17, 2021

Thanks @pgaref for the comments! I will fix them and then let you know.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @wgtmac .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't compare the logic with Java reader yet, but I'm wondering if we keep the logic structure consistently in both readers.

omalley and others added 3 commits February 17, 2021 12:40
Fixes apache#642

Signed-off-by: Owen O'Malley <omalley@apache.org>
This enables the analyze profile. As part of the change, it also fixes the
real findbugs warnings and blocks some false positives.

Fixes apache#643

Signed-off-by: Owen O'Malley <omalley@apache.org>
### What changes were proposed in this pull request?
Broken link

### Why are the changes needed?
Fix master build status badge

### How was this patch tested?
Grip locally
autumnust and others added 3 commits February 23, 2021 10:47
Fixes apache#637

Signed-off-by: Owen O'Malley <oomalley@linkedin.com>
* Replaced Charset.forName with StandardCharsets
* Final on some variables
* Removed final on a private method
* Simplified some conditions
* Removed redundant casts
* Removed duplicate condition branches

* Removed the maven compiler config setting in bench
@dongjoon-hyun
Copy link
Member

Since #645 is merged, could you rebase this to the master once more, @wgtmac ?

@wgtmac
Copy link
Member Author

wgtmac commented Mar 1, 2021

Since #645 is merged, could you rebase this to the master once more, @wgtmac ?

@dongjoon-hyun I'm working on it. Will let you know once ready.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 2, 2021

Hey @wgtmac thanks for the reminder!
Just went through the code and left some comments -- I believe we should add some more Tests here to increase confidence. Few example include: testPredEvalWithBooleanStats, testPredEvalWithIntStats etc.

public void testPredEvalWithBooleanStats() throws Exception {

On another note, ORC-40 seems to be about building SearchArgument so we should move this to a new ticket -- open ORC-751 for this

I have covered these kinds of cases in the TestPredicateLeaf.cc in a earlier commit. I can add more progressively to gain more confidence.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 2, 2021

Thank you for updating, @wgtmac . But, it looks a little weird to me~
Now this PR has 53 commits and 775 file changes.

enabled_merge_buttons:
merge: false
squash: true
rebase: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be here.

@wgtmac wgtmac changed the title ORC-40: [C++] Implement Predicate Pushdown for C++ Reader ORC-751: [C++] Implement Predicate Pushdown for C++ Reader Mar 3, 2021
@wgtmac wgtmac closed this Mar 3, 2021
dongjoon-hyun pushed a commit that referenced this pull request May 6, 2021
)

### What changes were proposed in this pull request?

The current implementation of ZlibDecompressionStream::seek and
BlockDecompressionStream::seek resets the state of the decompressor
and the underlying file reader and throws away their buffers.

### Why are the changes needed?

This commit introduces two optimizations which rely on reusing
the buffers that still contain useful data, and therefore reducing
the time spent reading/uncompressing the buffers again.

The first case is when the seeked position is already read
and decompressed into the output stream.

The second case is when the seeked position is already read from
the input stream, but has not been decompressed yet, ie. it's
not in the output stream.

### How was this patch tested?

Tests:
- Run the ORC tests, and the Impala tests working on ORC tables.
- The regression that #476 would cause is not present anymore.
@wgtmac wgtmac deleted the ORC-40 branch February 23, 2022 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet