PARQUET-1404: [C++] Adding Page level Indexes #6807

a2un · 2020-04-02T07:18:21Z

Implementation as per the doc in Parquet-922
https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#

…-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

github-actions · 2020-04-02T07:31:44Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

wesm · 2020-04-02T17:08:41Z

cpp/src/parquet/metadata.h

@@ -24,10 +24,15 @@
 #include <vector>

 #include "arrow/util/key_value_metadata.h"
-
+#include <boost/any.hpp>


Please don't add boost includes to any public headers

I am not using boost, I think the final commit doesn't have that header, if its there I will remote it

wesm

I'm still having a hard time understanding the public APIs that are proposed. Let's focus for the moment on the header files and documenting how the new APIs work / what their arguments mean

wesm · 2020-04-02T17:09:33Z

cpp/examples/parquet/low-level-api/run-imapala-queries.sh

@@ -0,0 +1,122 @@
+source $IMPALA_HOME/bin/impala-config.sh


I don't think this file should be here?

Yes, I will be removing after completing my tests

it is not part of the file PR

wesm · 2020-04-02T17:10:09Z

cpp/src/parquet/column_writer.h

+  // Write a batch of repetition levels, definition levels, and values to the
+  // column.
+  virtual void WriteBatchWithIndex(int64_t num_values, const int16_t* def_levels,
+                          const int16_t* rep_levels, const T* values) = 0;


What does this API do differently from WriteBatch?

I want to WriteBatch, but adding Column Index and Offset Index to the end of the of Parquet file, I will be updating the documentation

wesm · 2020-04-02T17:11:10Z

cpp/src/parquet/file_reader.h

@@ -43,6 +43,7 @@ class PARQUET_EXPORT RowGroupReader {
  struct Contents {
    virtual ~Contents() {}
    virtual std::unique_ptr<PageReader> GetColumnPageReader(int i) = 0;
+    virtual std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int i,void* predicate, int64_t& min_index, int predicate_Col, int64_t& row_index,Type::type type_num) = 0;


Add a comment explaining what this function does and what the arguments mean. The opaque void* predicate needs to be explained, for example

Note we don't use mutable references for arguments in this project

I wanted to add this function to utilize a predicate in scanning the column indices for skipping to the page that could contain the value

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

wesm · 2020-04-02T17:11:34Z

cpp/src/parquet/file_reader.h

@@ -56,8 +57,12 @@ class PARQUET_EXPORT RowGroupReader {
  // column. Ownership is shared with the RowGroupReader.
  std::shared_ptr<ColumnReader> Column(int i);

+  std::shared_ptr<ColumnReader> ColumnWithIndex(int i,void* predicate, int64_t& min_index, int predicate_col, int64_t& row_index,Type::type type_num);


What does this method do, and what do the arguments mean? What is void* predicate?

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

wesm · 2020-04-02T17:11:39Z

cpp/src/parquet/file_reader.h

  std::unique_ptr<PageReader> GetColumnPageReader(int i);

+  std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int column_index, void* predicate, int64_t& min_index , int predicate_col, int64_t& row_index,Type::type type_num);


Same question

predicate is the value that gets pushed down from the query engine, to be used to compared with column indices, and detect the page that could potentially contain the value of the predicate, I used void* for support for different parquet data types

github-actions · 2020-04-02T17:16:43Z

https://issues.apache.org/jira/browse/PARQUET-1401

ggershinsky · 2020-04-06T13:56:11Z

github-actions bot commented 4 days ago
https://issues.apache.org/jira/browse/PARQUET-1401

I'm curious why. This sends me all comments etc.

ggershinsky · 2020-04-06T13:59:14Z

This pull request seems to lack encryption of column indexes and column offsets. For an example of how it is done in Java, see apache/parquet-mr#776

wesm · 2020-04-06T22:05:52Z

I'm curious why. This sends me all comments etc.

This is to help contributors who want to navigate easily back to the JIRA issue

ggershinsky · 2020-04-07T05:27:09Z

I'm curious why. This sends me all comments etc.

This is to help contributors who want to navigate easily back to the JIRA issue

Looks like this pull requests is named after a wrong JIRA, https://issues.apache.org/jira/browse/PARQUET-1401 is about "RowGroup offset and total compressed size fields". Given the source branch, it should have been https://issues.apache.org/jira/browse/PARQUET-1404.

wesm · 2020-04-07T20:23:37Z

The GitHub action uses the GitHub PR title to match with the JIRA, I just fixed the title to be PARQUET-1404

wesm · 2020-04-07T20:25:31Z

@a2un it seems this patch is still very much WIP, would it be alright to close the PR and you can re-open when you have something closer to review/merge-ready for us to review? CI builds will still run on your fork

github-actions · 2020-04-07T20:31:46Z

https://issues.apache.org/jira/browse/PARQUET-1404

a2un · 2020-04-07T23:46:53Z

@a2un it seems this patch is still very much WIP, would it be alright to close the PR and you can re-open when you have something closer to review/merge-ready for us to review? CI builds will still run on your fork

Okay, I just wanted a place to hold discussion and ask you questions on the project

wesm · 2020-04-08T03:32:38Z

We can discuss more on the mailing list -- now that I know where the branch is, if you want me to take another look at the headers per the comments above let me know

malthe · 2020-10-02T11:35:25Z

I added an issue https://issues.apache.org/jira/browse/ARROW-10158.

Arun Balajieee and others added 30 commits August 9, 2019 15:20

parquet reader changes

56d11dc

Merge branch 'master' of https://github.com/apache/arrow into PARQUET…

05e10ad

…-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

parquet metadata

4c2c1dc

deserialize

7b5fd9f

calls

8412b9a

return type of read

0131ce1

populate index

c241f56

changes

8f584c5

offset changes

5aa4a9b

changes

95c47c1

print rows of the page;first value is the match value

650ab19

if cond

f7ffb74

can print page by comparing with page min

6b5081c

lower bound

4ae3e60

setup for file offset

10d3ed0

default setup

c0a9bb1

default setup

90a158f

no index

dcdaf94

added one page check

b2788eb

low level api

7c8477c

low level api

f52c54b

low level api

5102ffd

low level api

cf4c147

low level api

9dd006f

changed binary search

e33dd0d

added for 2 columns

77403b1

2-3 page handler; single col

55f0a6e

sorting columns

15c0676

generic reader

c0ee60a

sorting columns

5f0c779

a2un added 10 commits March 2, 2020 03:39

fixed bsearch;order by parquet file;skip pages

03bef46

order by needs limit if older impala

c4f99d2

tests

272f118

experiments

20660f0

use binary search

77931bb

string

22d3f67

make filesize 1 gig

e9e9b6a

print range

65ad001

range

bdcb004

ranges

9b6d6f1

wesm changed the title ~~Parquet-1404 Adding Page level Indexes~~ PARQUET-1401: [C++] Adding Page level Indexes Apr 2, 2020

wesm reviewed Apr 2, 2020

View reviewed changes

a2un added 3 commits April 5, 2020 03:29

time

f0823e9

time

1febdf9

time

61cb4a7

wesm changed the title ~~PARQUET-1401: [C++] Adding Page level Indexes~~ PARQUET-1404: [C++] Adding Page level Indexes Apr 7, 2020

a2un closed this Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1404: [C++] Adding Page level Indexes #6807

PARQUET-1404: [C++] Adding Page level Indexes #6807

a2un commented Apr 2, 2020

github-actions bot commented Apr 2, 2020

wesm Apr 2, 2020

a2un Apr 6, 2020 •

edited

wesm left a comment

wesm Apr 2, 2020

a2un Apr 6, 2020

a2un Apr 6, 2020

wesm Apr 2, 2020

a2un Apr 6, 2020

wesm Apr 2, 2020

a2un Apr 6, 2020

a2un Apr 6, 2020

wesm Apr 2, 2020

a2un Apr 6, 2020

wesm Apr 2, 2020

a2un Apr 6, 2020

github-actions bot commented Apr 2, 2020

ggershinsky commented Apr 6, 2020 •

edited

ggershinsky commented Apr 6, 2020

wesm commented Apr 6, 2020

ggershinsky commented Apr 7, 2020

wesm commented Apr 7, 2020

wesm commented Apr 7, 2020

github-actions bot commented Apr 7, 2020

a2un commented Apr 7, 2020

wesm commented Apr 8, 2020

malthe commented Oct 2, 2020

		std::unique_ptr<PageReader> GetColumnPageReader(int i);

		std::unique_ptr<PageReader> GetColumnPageReaderWithIndex(int column_index, void* predicate, int64_t& min_index , int predicate_col, int64_t& row_index,Type::type type_num);

PARQUET-1404: [C++] Adding Page level Indexes #6807

PARQUET-1404: [C++] Adding Page level Indexes #6807

Conversation

a2un commented Apr 2, 2020

github-actions bot commented Apr 2, 2020

Choose a reason for hiding this comment

a2un Apr 6, 2020 • edited

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 2, 2020

ggershinsky commented Apr 6, 2020 • edited

ggershinsky commented Apr 6, 2020

wesm commented Apr 6, 2020

ggershinsky commented Apr 7, 2020

wesm commented Apr 7, 2020

wesm commented Apr 7, 2020

github-actions bot commented Apr 7, 2020

a2un commented Apr 7, 2020

wesm commented Apr 8, 2020

malthe commented Oct 2, 2020

a2un Apr 6, 2020 •

edited

ggershinsky commented Apr 6, 2020 •

edited