ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

wesm · 2019-08-16T18:54:04Z

This only works for string types for the moment. Once ARROW-6277 is addressed we can expand to other types.

wesm · 2019-08-16T18:54:54Z

@jorisvandenbossche @jreback the pandas test suite will probably need some changes or expansions now that Categorical (for strings at least) can be faithfully roundtripped using to_parquet and pd.read_parquet?

jorisvandenbossche · 2019-08-16T19:03:49Z

@wesm indeed, created pandas-dev/pandas#27955 to track that.

jorisvandenbossche · 2019-08-16T19:05:33Z

python/pyarrow/tests/test_parquet.py

+    buf = BytesIO()
+    df.to_parquet(buf)
+
+    # This reads back object, but I expected category


this comment reads a bit unclear to me (it now reads back categorical?)

Sorry I copy-pasted this from the bug report

jorisvandenbossche · 2019-08-16T19:10:19Z

@wesm to what extent is this "fully" faithful for corner cases? (if not, might need to mention that as caveats in the pandas docs)

For example for a categorical with values in the "categories" which are not present in the data, is this preserved on reading back? (I suppose we use the categories when creating a DictionaryArray, but are its dictionary's values exactly preserved in the parquet roundtrip?)
Or the exact order of the categories in an ordered categorical (if it is not lexicographically sorted) ?

wesm · 2019-08-16T19:40:59Z

The category values will be exactly preserved whether or not they occur in the data. I can expand the unit test to exhibit this if it helps

wesm · 2019-08-16T19:45:39Z

Done

… values

…some reason

wesm · 2019-08-19T14:46:47Z

Rebased

jorisvandenbossche · 2019-08-19T14:59:24Z

python/pyarrow/tests/test_parquet.py

@@ -3015,7 +3015,6 @@ def test_dictionary_array_automatically_read():
    assert result.schema.metadata is None


-@pytest.mark.pandas


in the rest of the file, test functions using pandas are marked as such?

Yes, this is a rebase artifact, fixing

jorisvandenbossche · 2019-08-19T15:00:45Z

I can expand the unit test to exhibit this if it helps

Updated test looks good, thanks for the clarification!

wesm · 2019-08-19T17:58:13Z

Travis: https://travis-ci.org/wesm/arrow/builds/573882738

codecov-io · 2019-08-19T18:10:49Z

Codecov Report

Merging #5110 into master will decrease coverage by 22.6%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master    #5110       +/-   ##
===========================================
- Coverage   87.62%   65.02%   -22.61%     
===========================================
  Files        1014      495      -519     
  Lines      145828    67082    -78746     
  Branches     1437        0     -1437     
===========================================
- Hits       127788    43619    -84169     
- Misses      17678    23463     +5785     
+ Partials      362        0      -362

Impacted Files	Coverage Δ
python/pyarrow/tests/test_parquet.py	`96.42% <100%> (+0.02%)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/filesystem/util-internal.cc	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/logical_type.h	`0% <0%> (-100%)`	⬇️
cpp/src/parquet/hasher.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/basic_decimal_scalar.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/kernels/boolean.cc	`0% <0%> (-100%)`	⬇️
... and 762 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36bd667...a161b24. Read the comment docs.

jorisvandenbossche mentioned this pull request Aug 16, 2019

DOC/TST: update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support pandas-dev/pandas#27955

Closed

jorisvandenbossche reviewed Aug 16, 2019

View reviewed changes

wesm added 3 commits August 19, 2019 09:46

Add unit test for ARROW-5480

620b3b8

Improve unit test for out-of-order values, nulls, unobserved category…

9e98404

… values

Don't use pandas's Parquet functions since they don't work in CI for …

f1f8082

…some reason

wesm force-pushed the ARROW-5480 branch from 49574d4 to f1f8082 Compare August 19, 2019 14:46

jorisvandenbossche reviewed Aug 19, 2019

View reviewed changes

Add missing pandas marks

a161b24

wesm closed this in c4b8cb6 Aug 19, 2019

wesm deleted the ARROW-5480 branch August 19, 2019 18:02

galuhsahid mentioned this pull request Aug 19, 2019

DOC/TST: Update the parquet (pyarrow >= 0.15) docs and tests regarding Categorical support pandas-dev/pandas#28018

Merged

5 tasks

asfimport mentioned this pull request Apr 15, 2022

[Python] Pandas categorical type doesn't survive a round-trip through parquet #21930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

wesm commented Aug 16, 2019

wesm commented Aug 16, 2019

jorisvandenbossche commented Aug 16, 2019

jorisvandenbossche Aug 16, 2019

wesm Aug 16, 2019

jorisvandenbossche commented Aug 16, 2019

wesm commented Aug 16, 2019

wesm commented Aug 16, 2019

wesm commented Aug 19, 2019

jorisvandenbossche Aug 19, 2019

wesm Aug 19, 2019

jorisvandenbossche commented Aug 19, 2019

wesm commented Aug 19, 2019

codecov-io commented Aug 19, 2019

		@@ -3015,7 +3015,6 @@ def test_dictionary_array_automatically_read():
		assert result.schema.metadata is None


		@pytest.mark.pandas

ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

Conversation

wesm commented Aug 16, 2019

wesm commented Aug 16, 2019

jorisvandenbossche commented Aug 16, 2019

jorisvandenbossche Aug 16, 2019

Choose a reason for hiding this comment

wesm Aug 16, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 16, 2019

wesm commented Aug 16, 2019

wesm commented Aug 16, 2019

wesm commented Aug 19, 2019

jorisvandenbossche Aug 19, 2019

Choose a reason for hiding this comment

wesm Aug 19, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 19, 2019

wesm commented Aug 19, 2019

codecov-io commented Aug 19, 2019

Codecov Report