Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2310: implementation status #34

Draft
wants to merge 1 commit into
base: production
Choose a base branch
from

Conversation

alippai
Copy link

@alippai alippai commented Jun 20, 2023

Moved from: apache/arrow#36027

I only can agree with @westonpace:

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.

What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.

Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata

Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.

This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.

It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.

Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)


### Physical types

+-------------------------------------------+-------+--------+--------+-------+-------+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this standard Markdown markup for tables? Github doesn't seem to recognize it (see what happens here if you ask to "view" the file).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll check and convert it to the correct format. It was copy pasted from arrow-site

### Physical types

+-------------------------------------------+-------+--------+--------+-------+-------+
| Data type | C++ | Python | Java | Go | Rust |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a bit more specific than simply listing programming languages, as there might be several C++ or Java implementations? Replace them with their actual names.

Also, I don't think Python needs to be included here, since PyArrow is a binding around Arrow C++ (unless this is talking about another implementation, such as fastparquet)?

Copy link
Author

@alippai alippai Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on replacing with actual names.

I wasn’t sure if currently or historically pyarrow features match arrow c++. Likely it can be omitted from some tables. I could imagine eg int96/nanos/decimal (fixed array) capabilities or compression options differing from c++ and on high level API the difference was already confusing for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, PyArrow might not expose everything. But that's more of a PyArrow problem and I'm not sure it's worth exposing here. cc @jorisvandenbossche for opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the list here, I think most of the features that C++ supports are also exposed in pyarrow, I think.
Although I don't know if writing a type like UUID (that doesn't exist in arrow) is possible with the direct Parquet C++ API? (that's not possible through the Arrow based writer, I suppose, and so also not in pyarrow)

If we only keep the C++ column in the table, we should also add somewhere a note that this implementation is also exposed through Python, R, GLib, .. bindings.
(another reason to not include Python, because then why not also include R, etc)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the UUID is good counter example indeed :)

Copy link
Member

@pitrou pitrou Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support writing UUID from Arrow data indeed. But it probably can be done directly through the Parquet C++ APIs (not PyArrow).

+----------------------------------------------+-------+--------+--------+-------+-------+
| Partition pruning on the partition column | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+
| External column data | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add explicit notes to non-obvious items so that the user doesn't have to search for an explanation on the Internet :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

+----------------------------------------------+-------+--------+--------+-------+-------+
| External column data | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+
| RowGroup Sorting column | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what does this mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the term "sorting column" is really unfortunate (I find it very confusing: it's not an actual column, just a piece of metadata).

@gszadovszky Thoughts?

+----------------------------------------------+-------+--------+--------+-------+-------+
| Read / Write page metadata and data (2) | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+
| Page pruning using projection pushdown | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think "projection pushdown" I think "column selection". Is that what is being discussed here? I guess I wouldn't associate it with "page pruning" in my mind.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's column selection. I think this is the academic language also the Apache Arrow blog refers to that under the same name: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#projection-pushdown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing though, because it's a different kind of "page pruning" from the below (row-based pruning using statistics or bloom filters).

It's actually not so much "pruning" then "not reading all data at once, just the columns the user is interested in". I assume all reasonable Parquet implementations to do that already, does it need to be mentioned at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the “pruning” word in DB context I’m always assuming “skipping large chunks”, but I 100% agree that using the same term for both can be confusing

Comment on lines +155 to +157
| Hive-style partitioning | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+
| Partition pruning on the partition column | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two features could probably be left out. I can see how the others are high-level API features but they are still very much "exposing parquet capabilities". I think hive-style partitioning is completely unrelated to parquet the format though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of systems which support parquet, but not hive style partitioning is very limited. As in the intro I've explained my goal was to emphasize the different capabilities and levels of abstraction of the different implenetations. If we are strict about what's in the parquet format and ignore all the high level capabilities eg the pyarrow matches about 0% of the format as it can't access, manipulate or create parquet without the intermediate arrow format (unlike in Java and others)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't feel very strongly about this, but also we already have doxygen/sphinx and other API docs which are really good, so... yeah, maybe it's better to have it in a blogpost comparing / demoing the features)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in a "high level APIs" section I think it's reasonable to include, and informative for the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename it to “integrations” to express this better. Would that be more suitable?

Comment on lines +148 to +150
High level data API-s for parquet feature usage
===============================================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to convert this to Markdown.

| Page pruning using bloom filter | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+

* \(1) Ability to construct RowGroup objects from existing RowGroups or raw values (eg. without decoding or decompressing lower level data when read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure what this means. Reassemble a new Parquet file from existing undecoded column chunks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my intention. AFAIK Both rust and Java solutions support it to some degree

+----------------------------------------------+-------+--------+--------+-------+-------+
| RowGroup Sorting column | | | | | |
+----------------------------------------------+-------+--------+--------+-------+-------+
| Read / Write RowGroup metadata and data (1) | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit gratuitous to camel-case "RowGroup". Most of this file is English prose, so how about "row group"?

Comment on lines +131 to +133
| xxHash Bloom filters | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| bloom filter length | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| xxHash Bloom filters | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| bloom filter length | | | | | |
| xxHash-based bloom filters | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+
| Bloom filter length | | | | | |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure what "bloom filter length" is supposed to denote here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@alippai Perhaps you'd like to start filling in the table for Rust?

@alippai
Copy link
Author

alippai commented Jun 21, 2023

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

@alamb
Copy link
Contributor

alamb commented May 7, 2024

I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

I can certainly fill the one in for Rust

@alippai
Copy link
Author

alippai commented May 7, 2024

I’ll open a new draft later this week

@alamb
Copy link
Contributor

alamb commented May 11, 2024

FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants