Prototype parquet-integration-testing integration tests #5956

alamb · 2024-06-25T15:12:57Z

THIS IS A WIP, to show what such an integration suite might look like

Which issue does this PR close?

To be filed

Rationale for this change

Modeled after the arrow-testing integration tests, for example: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration

PARQUET-2310: implementation status parquet-site#34
Mailing list discussions: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33

What changes are included in this PR?

The high level proposal is that each implementation implements a driver that:

reads files from the parquet-testing repo
creates a JSON file with appropriately formatted contents
compare to "known good" files that will be checked in (maybe to parquet-format??)

I imagine JSON files for both data and metadata

The exact format of the JSON files is totally TBD -- this PR just makes whatever is convenient for arrow-rs

Are there any user-facing changes?

alamb · 2024-06-25T16:39:36Z

parquet-integration-testing/data/alltypes_plain.parquet.data.json

@@ -0,0 +1,285 @@
+{


The basic idea would be to check in files like this for all the various files in parquet-testing, and each parquet implementation could write a driver that made the equivalent JSON and checked it against the expected output

pitrou · 2024-06-25T16:51:23Z

I don't know, I think the Arrow integration testing procedure is relatively fine for Arrow (even though it doesn't allow us to exercise everything, such as dictionary deltas or compression) but won't scale for Parquet which as a ton of different options.

alamb · 2024-06-26T00:20:33Z

I don't know, I think the Arrow integration testing procedure is relatively fine for Arrow (even though it doesn't allow us to exercise everything, such as dictionary deltas or compression) but won't scale for Parquet which as a ton of different options.

Thanks @pitrou -- I agree there are many options for parquet, though I don't understand why that would prevent us from integration testing 🤔

While the "expected data" portion of such a test suite will be substantial, the actual code to make parquet-rs write this JSON format was quite small (and I expect it would be realtively straightforward for other implementations as well)

And conveniently, we already have a reasonable test corpus of files with various features in parquet-testing which this approach would simply re-use

pitrou · 2024-06-26T08:35:21Z

Well, I would have to see a JSON format that covers a substantial number of Parquet features before I can be convinced :-)

I'll note that the proposed JSON format features a file_offset property that doesn't sound like it should be fixed by the integration tests. On the contrary, each implementation should use its own heuristics when writing files (column placement, row group size, etc.).

pitrou · 2024-06-26T08:40:43Z

My alternative proposal would be a directory tree with pre-generated integration files, something like:

parquet-integration
|- all_types.plain.uncompressed
|  |- README.md   # textual description of this integration scenario
|  |- parquet-java_1.0.pq  # file generated by parquet-java 1.0 for said scenario
|  |- parquet-java_2.5.pq  # file generated by parquet-java 2.5
|  |- parquet-cpp_16.0.1.pq  # file generated by parquet-cpp 16.0.1
|- all_types.dictionary.uncompressed
| ...

... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.

The textual README.md could of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.

alamb · 2024-06-26T15:23:38Z

My alternative proposal would be a directory tree with pre-generated integration files, something like:

Thank you @pitrou -- I filed apache/parquet-format#441 to discuss this idea further

emkornfield · 2024-06-28T05:29:32Z

parquet-integration-testing/data/alltypes_plain.parquet.data.json

@@ -0,0 +1,285 @@
+{
+  "filename": "alltypes_plain.parquet",
+  "rows": [


We probably need to formalize the expected output for each physical/logic type.

The other thing I was thinking is it might pay to consider trade-offs between a row oriented format here compared to a column oriented format that is closer to what is actually written in parquet (i.e. rep/def levels and values). They both might be useful in some situations. For instance I've seen ill-formed parquet files in the wild because of inconsistent rep/def levels so ensuring there are sanity checks at that level makes sense.

Row based would certainly help for instance I've seen for non-conformant logical nested types like Lists/Maps.

I agree that if we want to move forward with this approach we should spend time formalizing and documenting what the expected format means

alamb added 5 commits June 25, 2024 11:05

Implement parquet-integration-testing Integration Tests

7e98e82

initial data path

74597f1

refactor

cd46026

Add basic metadata test

a345055

updates

168cec8

alamb commented Jun 25, 2024

View reviewed changes

alamb mentioned this pull request Jun 25, 2024

PARQUET-2310: implementation status apache/parquet-site#34

Merged

alamb mentioned this pull request Jun 26, 2024

Parquet compatibility / integration testing apache/parquet-format#441

Open

alamb closed this Jun 26, 2024

alamb reopened this Jun 26, 2024

emkornfield reviewed Jun 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype parquet-integration-testing integration tests #5956

Prototype parquet-integration-testing integration tests #5956

alamb commented Jun 25, 2024

alamb Jun 25, 2024

pitrou commented Jun 25, 2024

alamb commented Jun 26, 2024

pitrou commented Jun 26, 2024

pitrou commented Jun 26, 2024

alamb commented Jun 26, 2024

emkornfield Jun 28, 2024

alamb Jun 28, 2024

Prototype parquet-integration-testing integration tests #5956

Are you sure you want to change the base?

Prototype parquet-integration-testing integration tests #5956

Conversation

alamb commented Jun 25, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Jun 25, 2024

Choose a reason for hiding this comment

pitrou commented Jun 25, 2024

alamb commented Jun 26, 2024

pitrou commented Jun 26, 2024

pitrou commented Jun 26, 2024

alamb commented Jun 26, 2024

emkornfield Jun 28, 2024

Choose a reason for hiding this comment

alamb Jun 28, 2024

Choose a reason for hiding this comment