PARQUET-2310: implementation status #34

alippai · 2023-06-20T02:15:48Z

I only can agree with @westonpace:

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.

What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.

Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata

Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.

This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.

It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.

Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)

pitrou · 2023-06-20T06:22:07Z

content/en/docs/File Format/implementationstatus.md

+
+### Physical types
+
+-------------------------------------------+-------+--------+--------+-------+-------+


Is this standard Markdown markup for tables? Github doesn't seem to recognize it (see what happens here if you ask to "view" the file).

I’ll check and convert it to the correct format. It was copy pasted from arrow-site

pitrou · 2023-06-20T06:27:01Z

content/en/docs/File Format/implementationstatus.md

+### Physical types
+
+-------------------------------------------+-------+--------+--------+-------+-------+
+| Data type                                 | C++   | Python | Java   | Go    | Rust  |


I think this should be a bit more specific than simply listing programming languages, as there might be several C++ or Java implementations? Replace them with their actual names.

Also, I don't think Python needs to be included here, since PyArrow is a binding around Arrow C++ (unless this is talking about another implementation, such as fastparquet)?

Agreed on replacing with actual names.

I wasn’t sure if currently or historically pyarrow features match arrow c++. Likely it can be omitted from some tables. I could imagine eg int96/nanos/decimal (fixed array) capabilities or compression options differing from c++ and on high level API the difference was already confusing for me.

Yeah, PyArrow might not expose everything. But that's more of a PyArrow problem and I'm not sure it's worth exposing here. cc @jorisvandenbossche for opinions.

Looking at the list here, I think most of the features that C++ supports are also exposed in pyarrow, I think.
Although I don't know if writing a type like UUID (that doesn't exist in arrow) is possible with the direct Parquet C++ API? (that's not possible through the Arrow based writer, I suppose, and so also not in pyarrow)

If we only keep the C++ column in the table, we should also add somewhere a note that this implementation is also exposed through Python, R, GLib, .. bindings.
(another reason to not include Python, because then why not also include R, etc)

Actually the UUID is good counter example indeed :)

We don't support writing UUID from Arrow data indeed. But it probably can be done directly through the Parquet C++ APIs (not PyArrow).

westonpace · 2023-06-20T19:27:58Z

content/en/docs/File Format/implementationstatus.md

+----------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column    |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                         |       |        |        |       |       |


What is this?

https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/src/main/thrift/parquet.thrift#L789

Might want to add explicit notes to non-obvious items so that the user doesn't have to search for an explanation on the Internet :-)

westonpace · 2023-06-20T19:28:08Z

content/en/docs/File Format/implementationstatus.md

+----------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                         |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup Sorting column                      |       |        |        |       |       |


Also, what does this mean?

https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/src/main/thrift/parquet.thrift#L834

Hmm, the term "sorting column" is really unfortunate (I find it very confusing: it's not an actual column, just a piece of metadata).

@gszadovszky Thoughts?

westonpace · 2023-06-20T19:29:19Z

content/en/docs/File Format/implementationstatus.md

+----------------------------------------------+-------+--------+--------+-------+-------+
+| Read / Write page metadata and data (2)      |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown       |       |        |        |       |       |


When I think "projection pushdown" I think "column selection". Is that what is being discussed here? I guess I wouldn't associate it with "page pruning" in my mind.

Yes, that's column selection. I think this is the academic language also the Apache Arrow blog refers to that under the same name: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#projection-pushdown

It's a bit confusing though, because it's a different kind of "page pruning" from the below (row-based pruning using statistics or bloom filters).

It's actually not so much "pruning" then "not reading all data at once, just the columns the user is interested in". I assume all reasonable Parquet implementations to do that already, does it need to be mentioned at all?

When I read the “pruning” word in DB context I’m always assuming “skipping large chunks”, but I 100% agree that using the same term for both can be confusing

westonpace · 2023-06-20T19:30:39Z

content/en/docs/File Format/implementationstatus.md

+| Hive-style partitioning                      |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column    |       |        |        |       |       |


I think these two features could probably be left out. I can see how the others are high-level API features but they are still very much "exposing parquet capabilities". I think hive-style partitioning is completely unrelated to parquet the format though.

The number of systems which support parquet, but not hive style partitioning is very limited. As in the intro I've explained my goal was to emphasize the different capabilities and levels of abstraction of the different implenetations. If we are strict about what's in the parquet format and ignore all the high level capabilities eg the pyarrow matches about 0% of the format as it can't access, manipulate or create parquet without the intermediate arrow format (unlike in Java and others)

(I don't feel very strongly about this, but also we already have doxygen/sphinx and other API docs which are really good, so... yeah, maybe it's better to have it in a blogpost comparing / demoing the features)

Since this is in a "high level APIs" section I think it's reasonable to include, and informative for the user.

We can rename it to “integrations” to express this better. Would that be more suitable?

pitrou · 2023-06-21T13:43:23Z

content/en/docs/File Format/implementationstatus.md

+High level data API-s for parquet feature usage
+===============================================
+


Need to convert this to Markdown.

pitrou · 2023-06-21T13:45:03Z

content/en/docs/File Format/implementationstatus.md

+| Page pruning using bloom filter              |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+
+* \(1) Ability to construct RowGroup objects from existing RowGroups or raw values (eg. without decoding or decompressing lower level data when read)


I'm honestly not sure what this means. Reassemble a new Parquet file from existing undecoded column chunks?

Yes, that was my intention. AFAIK Both rust and Java solutions support it to some degree

pitrou · 2023-06-21T13:48:57Z

content/en/docs/File Format/implementationstatus.md

+----------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup Sorting column                      |       |        |        |       |       |
+----------------------------------------------+-------+--------+--------+-------+-------+
+| Read / Write RowGroup metadata and data (1)  |       |        |        |       |       |


It seems a bit gratuitous to camel-case "RowGroup". Most of this file is English prose, so how about "row group"?

pitrou · 2023-06-21T13:49:52Z

content/en/docs/File Format/implementationstatus.md

+| xxHash Bloom filters                      |       |        |        |       |       |
+-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |


Also, not sure what "bloom filter length" is supposed to denote here?

It's an upcoming feature https://github.com/apache/parquet-format/blob/1603152f8991809e8ad29659dffa224b4284f31b/src/main/thrift/parquet.thrift#L766

pitrou · 2023-06-21T13:50:40Z

@alippai Perhaps you'd like to start filling in the table for Rust?

alippai · 2023-06-21T15:56:52Z

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

alamb · 2024-05-07T10:28:26Z

I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

I can certainly fill the one in for Rust

alippai · 2024-05-07T12:15:28Z

I’ll open a new draft later this week

alamb · 2024-05-11T11:37:31Z

FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location

Add implementation status tables

9bc6e9c

This was referenced Jun 20, 2023

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status apache/arrow#36027

Closed

[Docs][Parquet] Document Parquet implementation status apache/arrow#36028

Closed

pitrou reviewed Jun 20, 2023

View reviewed changes

westonpace reviewed Jun 20, 2023

View reviewed changes

pitrou reviewed Jun 21, 2023

View reviewed changes

alamb mentioned this pull request May 14, 2024

First draft of docs about parquet format vs mr #53

Merged


		### Physical types

		+-------------------------------------------+-------+--------+--------+-------+-------+

		High level data API-s for parquet feature usage
		===============================================

PARQUET-2310: implementation status #34

Are you sure you want to change the base?

PARQUET-2310: implementation status #34

Conversation

alippai commented Jun 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alippai Jun 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Jun 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 21, 2023

alippai commented Jun 21, 2023

alamb commented May 7, 2024

alippai commented May 7, 2024

alamb commented May 11, 2024

alippai commented Jun 20, 2023 •

edited

alippai Jun 20, 2023 •

edited

pitrou Jun 21, 2023 •

edited