New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-577 row-level filtering #475
Conversation
Ok, I've updated this patch to go with my proposed changes with HIVE-23215 and pushed it to https://github.com/omalley/orc/tree/orc-577-v2 . |
Awesome, thanks @omalley ! Patch look good -- only comment is that I used a single instance of IOUtils per Reader on purpose to avoid false-sharing, for example when having multiple readers on the same machine-- so I think it makes sense to keep it Other than that, does it make sense to create an ORC branch for the effort? or just port the changes here? I am pretty sure we will have to make more changes before merging |
All of the calls were static methods on IOUtils, so it seemed better to remove them unless there is a huge performance impact. |
Let me dig deeper on this and I will create a follow-up depending on the findings. Cheers |
Using recently released storage-api 2.7.2 Change-Id: I2de933944dbdf92c4a98ae4528fa2a1d22342071
Change-Id: I1a070d3284d29faf440c43dc56f46b0efddd5fe2
Consolidating ORC-577 work and using recently released storage-api 2.7.2 that introduced FilterContext as part of VRB |
Change-Id: Ia5cbabdfd4acd0eff03955d564340346e077d561
To support row-level filtering functionality as part of the ORC Reader this PR adds an new Reader.option as:
Options setFilter(String columnName, Consumer<VectorizedRowBatch> filter)
The idea is to use a generic Consumer callback that can implement any kind of filtering logic that is completely independent of the rest of the row reading logic in ORC. As a result the we cut down on the total code dependency between ORC and the consumer frameworks.
The filter callback with have to set the selected and selectedSize values (that already exist) in the VectorizedRowBatch class. For instance the filter-example below will filter-out all the rows except the first one:
public static void intFirstRowFilter(VectorizedRowBatch batch) { LongColumnVector col1 = (LongColumnVector) batch.cols[0]; int newSize = 0; for (int row = 0; row <1024; ++row) { // Pass ony Valid key if (col1.vector[row] == 0) batch.selected[newSize++] = row; batch.selectedInUse = true; } batch.size = newSize;
The logic of the row-level filter is as follows TreeReader Logic: