Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-577 row-level filtering #475

Merged
merged 4 commits into from May 26, 2020
Merged

ORC-577 row-level filtering #475

merged 4 commits into from May 26, 2020

Conversation

pgaref
Copy link
Contributor

@pgaref pgaref commented Jan 30, 2020

To support row-level filtering functionality as part of the ORC Reader this PR adds an new Reader.option as:

Options setFilter(String columnName, Consumer<VectorizedRowBatch> filter)

The idea is to use a generic Consumer callback that can implement any kind of filtering logic that is completely independent of the rest of the row reading logic in ORC. As a result the we cut down on the total code dependency between ORC and the consumer frameworks.

 

The filter callback with have to set the selected and selectedSize values (that already exist) in the VectorizedRowBatch class. For instance the filter-example below will filter-out all the rows except the first one:

public static void intFirstRowFilter(VectorizedRowBatch batch) { LongColumnVector col1 = (LongColumnVector) batch.cols[0]; int newSize = 0; for (int row = 0; row <1024; ++row) { // Pass ony Valid key if (col1.vector[row] == 0) batch.selected[newSize++] = row; batch.selectedInUse = true; } batch.size = newSize;

The logic of the row-level filter is as follows TreeReader Logic: 

  1. First, read the give columnName(s)
  2. Evaluate the filter callback – filling the selected array and setting the selectedSize value as part of the VectorBatch
  3. For the remaining columns of the VectorBach read only the selected rows to reduce the number or read/decoded rows thus saving CPU cycles

java/core/src/java/org/apache/orc/Reader.java Outdated Show resolved Hide resolved
java/bench/core/pom.xml Outdated Show resolved Hide resolved
java/core/src/java/org/apache/orc/Reader.java Outdated Show resolved Hide resolved
@omalley
Copy link
Contributor

omalley commented Apr 17, 2020

Ok, I've updated this patch to go with my proposed changes with HIVE-23215 and pushed it to https://github.com/omalley/orc/tree/orc-577-v2 .

@pgaref
Copy link
Contributor Author

pgaref commented Apr 17, 2020

Ok, I've updated this patch to go with my proposed changes with HIVE-23215 and pushed it to https://github.com/omalley/orc/tree/orc-577-v2 .

Awesome, thanks @omalley ! Patch look good -- only comment is that I used a single instance of IOUtils per Reader on purpose to avoid false-sharing, for example when having multiple readers on the same machine-- so I think it makes sense to keep it

Other than that, does it make sense to create an ORC branch for the effort? or just port the changes here? I am pretty sure we will have to make more changes before merging

@omalley
Copy link
Contributor

omalley commented Apr 17, 2020

All of the calls were static methods on IOUtils, so it seemed better to remove them unless there is a huge performance impact.

@pgaref
Copy link
Contributor Author

pgaref commented Apr 20, 2020

All of the calls were static methods on IOUtils, so it seemed better to remove them unless there is a huge performance impact.

Let me dig deeper on this and I will create a follow-up depending on the findings.
Just updated the PR addressing the comments including the proposed changes plus some cleaning -- please let me know what you think.

Cheers

Panos Garefalakis added 2 commits May 17, 2020 18:26
Using recently released storage-api 2.7.2

Change-Id: I2de933944dbdf92c4a98ae4528fa2a1d22342071
Change-Id: I1a070d3284d29faf440c43dc56f46b0efddd5fe2
@pgaref
Copy link
Contributor Author

pgaref commented May 17, 2020

Consolidating ORC-577 work and using recently released storage-api 2.7.2 that introduced FilterContext as part of VRB

Change-Id: I039e0a87f981b3ceb690ed50f1bca0653d9e7f28
Change-Id: Ia5cbabdfd4acd0eff03955d564340346e077d561
@jcamachor jcamachor merged commit 6fac967 into apache:master May 26, 2020
jcamachor pushed a commit that referenced this pull request May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants