New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1404: [C++] Adding Page level Indexes #6807
PARQUET-1404: [C++] Adding Page level Indexes #6807
Conversation
…-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
@@ -24,10 +24,15 @@ | |||
#include <vector> | |||
|
|||
#include "arrow/util/key_value_metadata.h" | |||
|
|||
#include <boost/any.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add boost includes to any public headers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not using boost, I think the final commit doesn't have that header, if its there I will remote it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still having a hard time understanding the public APIs that are proposed. Let's focus for the moment on the header files and documenting how the new APIs work / what their arguments mean
@@ -0,0 +1,122 @@ | |||
source $IMPALA_HOME/bin/impala-config.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this file should be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will be removing after completing my tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not part of the file PR
// Write a batch of repetition levels, definition levels, and values to the | ||
// column. | ||
virtual void WriteBatchWithIndex(int64_t num_values, const int16_t* def_levels, | ||
const int16_t* rep_levels, const T* values) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this API do differently from WriteBatch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to WriteBatch, but adding Column Index and Offset Index to the end of the of Parquet file, I will be updating the documentation
@@ -43,6 +43,7 @@ class PARQUET_EXPORT RowGroupReader { | |||
struct Contents { | |||
virtual ~Contents() {} | |||
virtual std::unique_ptr<PageReader> GetColumnPageReader(int i) = 0; | |||
virtual std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int i,void* predicate, int64_t& min_index, int predicate_Col, int64_t& row_index,Type::type type_num) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining what this function does and what the arguments mean. The opaque void* predicate
needs to be explained, for example
Note we don't use mutable references for arguments in this project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add this function to utilize a predicate in scanning the column indices for skipping to the page that could contain the value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types
@@ -56,8 +57,12 @@ class PARQUET_EXPORT RowGroupReader { | |||
// column. Ownership is shared with the RowGroupReader. | |||
std::shared_ptr<ColumnReader> Column(int i); | |||
|
|||
std::shared_ptr<ColumnReader> ColumnWithIndex(int i,void* predicate, int64_t& min_index, int predicate_col, int64_t& row_index,Type::type type_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this method do, and what do the arguments mean? What is void* predicate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types
std::unique_ptr<PageReader> GetColumnPageReader(int i); | ||
|
||
std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int column_index, void* predicate, int64_t& min_index , int predicate_col, int64_t& row_index,Type::type type_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types
I'm curious why. This sends me all comments etc. |
This pull request seems to lack encryption of column indexes and column offsets. For an example of how it is done in Java, see apache/parquet-mr#776 |
This is to help contributors who want to navigate easily back to the JIRA issue |
Looks like this pull requests is named after a wrong JIRA, https://issues.apache.org/jira/browse/PARQUET-1401 is about "RowGroup offset and total compressed size fields". Given the source branch, it should have been https://issues.apache.org/jira/browse/PARQUET-1404. |
The GitHub action uses the GitHub PR title to match with the JIRA, I just fixed the title to be PARQUET-1404 |
@a2un it seems this patch is still very much WIP, would it be alright to close the PR and you can re-open when you have something closer to review/merge-ready for us to review? CI builds will still run on your fork |
Okay, I just wanted a place to hold discussion and ask you questions on the project |
We can discuss more on the mailing list -- now that I know where the branch is, if you want me to take another look at the headers per the comments above let me know |
I added an issue https://issues.apache.org/jira/browse/ARROW-10158. |
Implementation as per the doc in Parquet-922
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#