Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1404: [C++] Adding Page level Indexes #6807

Conversation

a2un
Copy link

@a2un a2un commented Apr 2, 2020

@github-actions
Copy link

github-actions bot commented Apr 2, 2020

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@wesm wesm changed the title Parquet-1404 Adding Page level Indexes PARQUET-1401: [C++] Adding Page level Indexes Apr 2, 2020
@@ -24,10 +24,15 @@
#include <vector>

#include "arrow/util/key_value_metadata.h"

#include <boost/any.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add boost includes to any public headers

Copy link
Author

@a2un a2un Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not using boost, I think the final commit doesn't have that header, if its there I will remote it

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having a hard time understanding the public APIs that are proposed. Let's focus for the moment on the header files and documenting how the new APIs work / what their arguments mean

@@ -0,0 +1,122 @@
source $IMPALA_HOME/bin/impala-config.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this file should be here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will be removing after completing my tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not part of the file PR

// Write a batch of repetition levels, definition levels, and values to the
// column.
virtual void WriteBatchWithIndex(int64_t num_values, const int16_t* def_levels,
const int16_t* rep_levels, const T* values) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this API do differently from WriteBatch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to WriteBatch, but adding Column Index and Offset Index to the end of the of Parquet file, I will be updating the documentation

@@ -43,6 +43,7 @@ class PARQUET_EXPORT RowGroupReader {
struct Contents {
virtual ~Contents() {}
virtual std::unique_ptr<PageReader> GetColumnPageReader(int i) = 0;
virtual std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int i,void* predicate, int64_t& min_index, int predicate_Col, int64_t& row_index,Type::type type_num) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining what this function does and what the arguments mean. The opaque void* predicate needs to be explained, for example

Note we don't use mutable references for arguments in this project

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add this function to utilize a predicate in scanning the column indices for skipping to the page that could contain the value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

@@ -56,8 +57,12 @@ class PARQUET_EXPORT RowGroupReader {
// column. Ownership is shared with the RowGroupReader.
std::shared_ptr<ColumnReader> Column(int i);

std::shared_ptr<ColumnReader> ColumnWithIndex(int i,void* predicate, int64_t& min_index, int predicate_col, int64_t& row_index,Type::type type_num);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this method do, and what do the arguments mean? What is void* predicate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

std::unique_ptr<PageReader> GetColumnPageReader(int i);

std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int column_index, void* predicate, int64_t& min_index , int predicate_col, int64_t& row_index,Type::type type_num);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

@github-actions
Copy link

github-actions bot commented Apr 2, 2020

@ggershinsky
Copy link
Contributor

ggershinsky commented Apr 6, 2020

github-actions bot commented 4 days ago
https://issues.apache.org/jira/browse/PARQUET-1401

I'm curious why. This sends me all comments etc.

@ggershinsky
Copy link
Contributor

This pull request seems to lack encryption of column indexes and column offsets. For an example of how it is done in Java, see apache/parquet-mr#776

@wesm
Copy link
Member

wesm commented Apr 6, 2020

I'm curious why. This sends me all comments etc.

This is to help contributors who want to navigate easily back to the JIRA issue

@ggershinsky
Copy link
Contributor

I'm curious why. This sends me all comments etc.

This is to help contributors who want to navigate easily back to the JIRA issue

Looks like this pull requests is named after a wrong JIRA, https://issues.apache.org/jira/browse/PARQUET-1401 is about "RowGroup offset and total compressed size fields". Given the source branch, it should have been https://issues.apache.org/jira/browse/PARQUET-1404.

@wesm wesm changed the title PARQUET-1401: [C++] Adding Page level Indexes PARQUET-1404: [C++] Adding Page level Indexes Apr 7, 2020
@wesm
Copy link
Member

wesm commented Apr 7, 2020

The GitHub action uses the GitHub PR title to match with the JIRA, I just fixed the title to be PARQUET-1404

@wesm
Copy link
Member

wesm commented Apr 7, 2020

@a2un it seems this patch is still very much WIP, would it be alright to close the PR and you can re-open when you have something closer to review/merge-ready for us to review? CI builds will still run on your fork

@github-actions
Copy link

github-actions bot commented Apr 7, 2020

@a2un
Copy link
Author

a2un commented Apr 7, 2020

@a2un it seems this patch is still very much WIP, would it be alright to close the PR and you can re-open when you have something closer to review/merge-ready for us to review? CI builds will still run on your fork

Okay, I just wanted a place to hold discussion and ask you questions on the project

@a2un a2un closed this Apr 7, 2020
@wesm
Copy link
Member

wesm commented Apr 8, 2020

We can discuss more on the mailing list -- now that I know where the branch is, if you want me to take another look at the headers per the comments above let me know

@malthe
Copy link

malthe commented Oct 2, 2020

I added an issue https://issues.apache.org/jira/browse/ARROW-10158.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants