ARROW-9027: [Python][Testing] Split parquet tests into multiple files + clean-up #8816

arw2019 · 2020-12-02T00:57:35Z

Only relocation - none of the tests are touched.

github-actions · 2020-12-02T01:12:08Z

https://issues.apache.org/jira/browse/ARROW-9027

jorisvandenbossche

Thanks a lot for looking into this!

A recent merged PR (#8704) added a parquet test, so maybe ensure that your rebase correctly picked it up.
We probably shouldn't let this take too long, to avoid other conflicts.

The test_basic.py is still a big chunk, wondering if we can further split that. There are some tests specific to ParquetFile API, which could be split (but it's also not a big chunk I think)

python/pyarrow/tests/parquet/common.py

jorisvandenbossche · 2020-12-04T11:18:38Z

python/pyarrow/tests/parquet/test_multifile_ds.py

+    LocalFileSystem._get_instance(),
+    fs.LocalFileSystem(),
+])
+def test_parquet_writer_filesystem_local(tempdir, filesystem):


This and the tests below are a bunch of ParquetWriter related tests, which are not directly related to multi-file datasets, so can probably be moved elsewhere (either to test_basic, or to separate file)

I separated them out (into test_writer.py)

python/pyarrow/tests/parquet/test_multifile_ds.py

arw2019 · 2020-12-05T06:59:44Z

python/pyarrow/tests/parquet/test_pandas.py

@@ -0,0 +1,759 @@
+# Licensed to the Apache Software Foundation (ASF) under one


these aren't pandas-dependent tests - rather they're the tests that test interop with pandas data structures

arw2019 · 2020-12-08T05:57:35Z

This is ready for re-review (modulopyarrow/tests/test_orc.py failing on Python / AMD64 MacOS 10.15 Python 3 for a reason I haven't figured out yet)

jorisvandenbossche · 2020-12-10T13:21:58Z

Sorry, we merged another PR which added a test to to_parquet.py: #8861. Can you check this is included correctly?

jorisvandenbossche

Few more comments (and thanks again for working on this one, it's not the most rewarding issue ;))

python/pyarrow/tests/parquet/common.py

python/pyarrow/tests/parquet/test_metadata.py

python/pyarrow/tests/parquet/test_basic.py

jorisvandenbossche · 2020-12-10T13:47:59Z

python/pyarrow/tests/parquet/test_basic.py

+
+@parametrize_legacy_dataset
+@pytest.mark.pandas
+def test_filter_before_validate_schema(tempdir, use_legacy_dataset):


this one can be moved to test_dataset.py, I think, since it's a dataset specific feature that is being tested (although it's using the read_table function in the test, that dispatches to ParquetDataset)

Moved it to test_dataset

arw2019 · 2020-12-11T17:04:28Z

Sorry, we merged another PR which added a test to to_parquet.py: #8861. Can you check this is included correctly?

Yes, on rebasing (it's in test_metadata.py)

… + clean-up

…_writer

jorisvandenbossche · 2020-12-21T14:32:41Z

@arw2019 updated this a bit further, and will merge now. Thanks!

With our workflow policy of rebasing / force pushing, it was basically impossible to review your additional changes ... (not your fault to be clear! Just a workflow for which the github interface is not made ..)
So I went through the files locally, and made a few additional changes.

@jorisvandenbossche

… + clean-up Only relocation - none of the tests are touched. cc @jorisvandenbossche Closes apache#8816 from arw2019/ARROW-9027-test_parquet Lead-authored-by: Andrew Wieteska <andrew.r.wieteska@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added the Component: Python label Dec 2, 2020

arw2019 force-pushed the ARROW-9027-test_parquet branch 4 times, most recently from 31f997a to 77b89ae Compare December 2, 2020 04:22

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 2, 2020

jorisvandenbossche reviewed Dec 4, 2020

View reviewed changes

arw2019 force-pushed the ARROW-9027-test_parquet branch from bab48bd to 1b421cf Compare December 4, 2020 16:15

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 4, 2020

arw2019 marked this pull request as draft December 4, 2020 18:14

arw2019 force-pushed the ARROW-9027-test_parquet branch from 05f73a5 to b006b26 Compare December 5, 2020 06:55

arw2019 marked this pull request as ready for review December 5, 2020 06:57

arw2019 commented Dec 5, 2020

View reviewed changes

arw2019 force-pushed the ARROW-9027-test_parquet branch from b006b26 to 26fa390 Compare December 7, 2020 07:11

jorisvandenbossche mentioned this pull request Dec 7, 2020

ARROW-10146: [Python] Fix parquet FileMetadata.to_dict in case statistics is not set #8861

Closed

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 7, 2020

arw2019 force-pushed the ARROW-9027-test_parquet branch from 26fa390 to 532f33f Compare December 7, 2020 22:26

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 7, 2020

arw2019 force-pushed the ARROW-9027-test_parquet branch from c2cff61 to 93f198e Compare December 9, 2020 07:38

jorisvandenbossche reviewed Dec 10, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 10, 2020

ARROW-10574: [Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets) #8672

Closed

arw2019 force-pushed the ARROW-9027-test_parquet branch from 93f198e to 8534c7d Compare December 11, 2020 17:03

arw2019 force-pushed the ARROW-9027-test_parquet branch from d18cef7 to e12fb86 Compare December 11, 2020 19:13

arw2019 added 3 commits December 21, 2020 11:50

ARROW-9027: [Python][Testing] Split Parquet tests into multiple files…

edf6c6d

… + clean-up

rename test_multifile_ds -> test_dataset

d4ad93c

add test from ARROW-10644

09bab84

arw2019 and others added 8 commits December 21, 2020 11:50

split test_basic more

e210cff

formatting

2369ece

formatting

4bf23c1

formatting

f313528

more splits

a4b0a95

rebase error

b8983ac

ARROW-9027: [Python][Testing] Split parquet tests into multiple files…

c309fb0

… + clean-up

fix ORC basedir

4af670a

jorisvandenbossche force-pushed the ARROW-9027-test_parquet branch from e12fb86 to 4af670a Compare December 21, 2020 10:56

jorisvandenbossche added 3 commits December 21, 2020 12:32

move from common to test_dataset + move some additional tests to test…

14eb1ef

…_writer

also separate specific ParquetFile tests + remove test_empty

f496cf4

split off data type specific tests + combine with dictionary tests

0b57506

jorisvandenbossche closed this in c751295 Dec 21, 2020

asfimport mentioned this pull request Dec 21, 2020

[Python] Split in multiple files + clean-up pyarrow.parquet tests #25144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9027: [Python][Testing] Split parquet tests into multiple files + clean-up #8816

ARROW-9027: [Python][Testing] Split parquet tests into multiple files + clean-up #8816

arw2019 commented Dec 2, 2020

github-actions bot commented Dec 2, 2020

jorisvandenbossche left a comment

jorisvandenbossche Dec 4, 2020

arw2019 Dec 5, 2020

arw2019 Dec 5, 2020

arw2019 commented Dec 8, 2020

jorisvandenbossche commented Dec 10, 2020

jorisvandenbossche left a comment

jorisvandenbossche Dec 10, 2020

arw2019 Dec 11, 2020

arw2019 commented Dec 11, 2020

jorisvandenbossche commented Dec 21, 2020

		@@ -0,0 +1,759 @@
		# Licensed to the Apache Software Foundation (ASF) under one

ARROW-9027: [Python][Testing] Split parquet tests into multiple files + clean-up #8816

ARROW-9027: [Python][Testing] Split parquet tests into multiple files + clean-up #8816

Conversation

arw2019 commented Dec 2, 2020

github-actions bot commented Dec 2, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 4, 2020

Choose a reason for hiding this comment

arw2019 Dec 5, 2020

Choose a reason for hiding this comment

arw2019 Dec 5, 2020

Choose a reason for hiding this comment

arw2019 commented Dec 8, 2020

jorisvandenbossche commented Dec 10, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 10, 2020

Choose a reason for hiding this comment

arw2019 Dec 11, 2020

Choose a reason for hiding this comment

arw2019 commented Dec 11, 2020

jorisvandenbossche commented Dec 21, 2020