Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Add exhaustive integration testing with all possible Parquet types #37943

Closed
danepitkin opened this issue Sep 28, 2023 · 2 comments · Fixed by #38249
Closed

[Parquet] Add exhaustive integration testing with all possible Parquet types #37943

danepitkin opened this issue Sep 28, 2023 · 2 comments · Fixed by #38249

Comments

@danepitkin
Copy link
Member

Describe the enhancement requested

Arrow and Parquet does not have exhaustive integration testing for all possible Parquet data types.

For example, it would be useful if there was a single simple sample Parquet file that had only 1 or 2 rows of data, but covered as much of the type feature space as possible. This would also be useful for testing backwards compatibility of versions e.g. to help catch issues like these[1].

The arrow testing data currently lives in a separate repo[2].

We should:

  • Put together a directory/list/repo of parquet file(s) that can hit the cross section of features/types/encodings to be a good test suite
  • Create the infrastructure for actually testing against them e.g. Parquet reader tests

[1]https://lists.apache.org/thread/4sw2vfmdx60kl2psolwvch8h2297zdkb
[2]https://github.com/apache/arrow-testing/tree/47f7b56b25683202c1fd957668e13f2abafc0f12

Component(s)

Parquet

@mapleFU
Copy link
Member

mapleFU commented Sep 29, 2023

Hi dane, I'd like to do with it fuzzing, but still we have lots of types that we cannot support :-(

@jorisvandenbossche
Copy link
Member

An older issue about this: #22325

jduo added a commit to jduo/arrow that referenced this issue Oct 12, 2023
Add a reference file with all supported types and corresponding test case
to validate that the Dataset API generates this consistently.
jduo added a commit to jduo/arrow that referenced this issue Oct 12, 2023
Add a reference file with all supported types and corresponding test case
to validate that the Dataset API generates this consistently.
jduo added a commit to jduo/arrow that referenced this issue Oct 12, 2023
Add a reference file with all supported types and corresponding test case
to validate that the Dataset API generates this consistently.
jduo added a commit to jduo/arrow that referenced this issue Oct 13, 2023
Add a reference file with all supported types and corresponding test case
to validate that the Dataset API generates this consistently.
jduo added a commit to jduo/arrow that referenced this issue Oct 19, 2023
Add a reference file with all supported types and corresponding test case
to validate that the Dataset API generates this consistently.
lidavidm pushed a commit that referenced this issue Oct 20, 2023
### Rationale for this change
Validate the types the Dataset APIs support when generating Parquet files.

### What changes are included in this PR?
Add a reference file with all supported types and corresponding test case to validate that the Dataset API generates this consistently.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: #37943

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 15.0.0 milestone Oct 20, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…che#38249)

### Rationale for this change
Validate the types the Dataset APIs support when generating Parquet files.

### What changes are included in this PR?
Add a reference file with all supported types and corresponding test case to validate that the Dataset API generates this consistently.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37943

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 25, 2023
…che#38249)

### Rationale for this change
Validate the types the Dataset APIs support when generating Parquet files.

### What changes are included in this PR?
Add a reference file with all supported types and corresponding test case to validate that the Dataset API generates this consistently.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37943

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…che#38249)

### Rationale for this change
Validate the types the Dataset APIs support when generating Parquet files.

### What changes are included in this PR?
Add a reference file with all supported types and corresponding test case to validate that the Dataset API generates this consistently.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37943

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…che#38249)

### Rationale for this change
Validate the types the Dataset APIs support when generating Parquet files.

### What changes are included in this PR?
Add a reference file with all supported types and corresponding test case to validate that the Dataset API generates this consistently.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37943

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants