Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype parquet-integration-testing integration tests #5956

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 25, 2024

THIS IS A WIP, to show what such an integration suite might look like

Which issue does this PR close?

To be filed

Rationale for this change

Modeled after the arrow-testing integration tests, for example: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration

Related:

What changes are included in this PR?

The high level proposal is that each implementation implements a driver that:

  1. reads files from the parquet-testing repo
  2. creates a JSON file with appropriately formatted contents
  3. compare to "known good" files that will be checked in (maybe to parquet-format??)

I imagine JSON files for both data and metadata

The exact format of the JSON files is totally TBD -- this PR just makes whatever is convenient for arrow-rs

Are there any user-facing changes?

@@ -0,0 +1,285 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea would be to check in files like this for all the various files in parquet-testing, and each parquet implementation could write a driver that made the equivalent JSON and checked it against the expected output

@pitrou
Copy link
Member

pitrou commented Jun 25, 2024

I don't know, I think the Arrow integration testing procedure is relatively fine for Arrow (even though it doesn't allow us to exercise everything, such as dictionary deltas or compression) but won't scale for Parquet which as a ton of different options.

@alamb
Copy link
Contributor Author

alamb commented Jun 26, 2024

I don't know, I think the Arrow integration testing procedure is relatively fine for Arrow (even though it doesn't allow us to exercise everything, such as dictionary deltas or compression) but won't scale for Parquet which as a ton of different options.

Thanks @pitrou -- I agree there are many options for parquet, though I don't understand why that would prevent us from integration testing 🤔

While the "expected data" portion of such a test suite will be substantial, the actual code to make parquet-rs write this JSON format was quite small (and I expect it would be realtively straightforward for other implementations as well)

And conveniently, we already have a reasonable test corpus of files with various features in parquet-testing which this approach would simply re-use

@pitrou
Copy link
Member

pitrou commented Jun 26, 2024

Well, I would have to see a JSON format that covers a substantial number of Parquet features before I can be convinced :-)

I'll note that the proposed JSON format features a file_offset property that doesn't sound like it should be fixed by the integration tests. On the contrary, each implementation should use its own heuristics when writing files (column placement, row group size, etc.).

@pitrou
Copy link
Member

pitrou commented Jun 26, 2024

My alternative proposal would be a directory tree with pre-generated integration files, something like:

parquet-integration
|- all_types.plain.uncompressed
|  |- README.md   # textual description of this integration scenario
|  |- parquet-java_1.0.pq  # file generated by parquet-java 1.0 for said scenario
|  |- parquet-java_2.5.pq  # file generated by parquet-java 2.5
|  |- parquet-cpp_16.0.1.pq  # file generated by parquet-cpp 16.0.1
|- all_types.dictionary.uncompressed
| ...

... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.

The textual README.md could of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.

@alamb
Copy link
Contributor Author

alamb commented Jun 26, 2024

My alternative proposal would be a directory tree with pre-generated integration files, something like:

Thank you @pitrou -- I filed apache/parquet-format#441 to discuss this idea further

@@ -0,0 +1,285 @@
{
"filename": "alltypes_plain.parquet",
"rows": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to formalize the expected output for each physical/logic type.

The other thing I was thinking is it might pay to consider trade-offs between a row oriented format here compared to a column oriented format that is closer to what is actually written in parquet (i.e. rep/def levels and values). They both might be useful in some situations. For instance I've seen ill-formed parquet files in the wild because of inconsistent rep/def levels so ensuring there are sanity checks at that level makes sense.

Row based would certainly help for instance I've seen for non-conformant logical nested types like Lists/Maps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if we want to move forward with this approach we should spend time formalizing and documenting what the expected format means

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants