-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2310: implementation status #34
base: production
Are you sure you want to change the base?
PARQUET-2310: implementation status #34
Conversation
|
||
### Physical types | ||
|
||
+-------------------------------------------+-------+--------+--------+-------+-------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this standard Markdown markup for tables? Github doesn't seem to recognize it (see what happens here if you ask to "view" the file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll check and convert it to the correct format. It was copy pasted from arrow-site
### Physical types | ||
|
||
+-------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Data type | C++ | Python | Java | Go | Rust | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a bit more specific than simply listing programming languages, as there might be several C++ or Java implementations? Replace them with their actual names.
Also, I don't think Python needs to be included here, since PyArrow is a binding around Arrow C++ (unless this is talking about another implementation, such as fastparquet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on replacing with actual names.
I wasn’t sure if currently or historically pyarrow features match arrow c++. Likely it can be omitted from some tables. I could imagine eg int96/nanos/decimal (fixed array) capabilities or compression options differing from c++ and on high level API the difference was already confusing for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, PyArrow might not expose everything. But that's more of a PyArrow problem and I'm not sure it's worth exposing here. cc @jorisvandenbossche for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the list here, I think most of the features that C++ supports are also exposed in pyarrow, I think.
Although I don't know if writing a type like UUID (that doesn't exist in arrow) is possible with the direct Parquet C++ API? (that's not possible through the Arrow based writer, I suppose, and so also not in pyarrow)
If we only keep the C++ column in the table, we should also add somewhere a note that this implementation is also exposed through Python, R, GLib, .. bindings.
(another reason to not include Python, because then why not also include R, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the UUID is good counter example indeed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support writing UUID from Arrow data indeed. But it probably can be done directly through the Parquet C++ APIs (not PyArrow).
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Partition pruning on the partition column | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| External column data | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add explicit notes to non-obvious items so that the user doesn't have to search for an explanation on the Internet :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| External column data | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| RowGroup Sorting column | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the term "sorting column" is really unfortunate (I find it very confusing: it's not an actual column, just a piece of metadata).
@gszadovszky Thoughts?
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Read / Write page metadata and data (2) | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Page pruning using projection pushdown | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think "projection pushdown" I think "column selection". Is that what is being discussed here? I guess I wouldn't associate it with "page pruning" in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's column selection. I think this is the academic language also the Apache Arrow blog refers to that under the same name: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#projection-pushdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing though, because it's a different kind of "page pruning" from the below (row-based pruning using statistics or bloom filters).
It's actually not so much "pruning" then "not reading all data at once, just the columns the user is interested in". I assume all reasonable Parquet implementations to do that already, does it need to be mentioned at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the “pruning” word in DB context I’m always assuming “skipping large chunks”, but I 100% agree that using the same term for both can be confusing
| Hive-style partitioning | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Partition pruning on the partition column | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two features could probably be left out. I can see how the others are high-level API features but they are still very much "exposing parquet capabilities". I think hive-style partitioning is completely unrelated to parquet the format though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of systems which support parquet, but not hive style partitioning is very limited. As in the intro I've explained my goal was to emphasize the different capabilities and levels of abstraction of the different implenetations. If we are strict about what's in the parquet format and ignore all the high level capabilities eg the pyarrow matches about 0% of the format as it can't access, manipulate or create parquet without the intermediate arrow format (unlike in Java and others)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't feel very strongly about this, but also we already have doxygen/sphinx and other API docs which are really good, so... yeah, maybe it's better to have it in a blogpost comparing / demoing the features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in a "high level APIs" section I think it's reasonable to include, and informative for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rename it to “integrations” to express this better. Would that be more suitable?
High level data API-s for parquet feature usage | ||
=============================================== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to convert this to Markdown.
| Page pruning using bloom filter | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
|
||
* \(1) Ability to construct RowGroup objects from existing RowGroups or raw values (eg. without decoding or decompressing lower level data when read) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure what this means. Reassemble a new Parquet file from existing undecoded column chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my intention. AFAIK Both rust and Java solutions support it to some degree
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| RowGroup Sorting column | | | | | | | ||
+----------------------------------------------+-------+--------+--------+-------+-------+ | ||
| Read / Write RowGroup metadata and data (1) | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit gratuitous to camel-case "RowGroup". Most of this file is English prose, so how about "row group"?
| xxHash Bloom filters | | | | | | | ||
+-------------------------------------------+-------+--------+--------+-------+-------+ | ||
| bloom filter length | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| xxHash Bloom filters | | | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+ | |
| bloom filter length | | | | | | | |
| xxHash-based bloom filters | | | | | | | |
+-------------------------------------------+-------+--------+--------+-------+-------+ | |
| Bloom filter length | | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure what "bloom filter length" is supposed to denote here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alippai Perhaps you'd like to start filling in the table for Rust? |
Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources |
I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?
I can certainly fill the one in for Rust |
I’ll open a new draft later this week |
FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location |
Moved from: apache/arrow#36027
I only can agree with @westonpace:
The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.
What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.
Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata
Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.
This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.
It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.
Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)