Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110

Closed
wants to merge 4 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Aug 16, 2019

This only works for string types for the moment. Once ARROW-6277 is addressed we can expand to other types.

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

@jorisvandenbossche @jreback the pandas test suite will probably need some changes or expansions now that Categorical (for strings at least) can be faithfully roundtripped using to_parquet and pd.read_parquet?

@jorisvandenbossche
Copy link
Member

@wesm indeed, created pandas-dev/pandas#27955 to track that.

buf = BytesIO()
df.to_parquet(buf)

# This reads back object, but I expected category
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment reads a bit unclear to me (it now reads back categorical?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I copy-pasted this from the bug report

@jorisvandenbossche
Copy link
Member

@wesm to what extent is this "fully" faithful for corner cases? (if not, might need to mention that as caveats in the pandas docs)

For example for a categorical with values in the "categories" which are not present in the data, is this preserved on reading back? (I suppose we use the categories when creating a DictionaryArray, but are its dictionary's values exactly preserved in the parquet roundtrip?)
Or the exact order of the categories in an ordered categorical (if it is not lexicographically sorted) ?

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

The category values will be exactly preserved whether or not they occur in the data. I can expand the unit test to exhibit this if it helps

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

Done

@wesm
Copy link
Member Author

wesm commented Aug 19, 2019

Rebased

@@ -3015,7 +3015,6 @@ def test_dictionary_array_automatically_read():
assert result.schema.metadata is None


@pytest.mark.pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the rest of the file, test functions using pandas are marked as such?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a rebase artifact, fixing

@jorisvandenbossche
Copy link
Member

I can expand the unit test to exhibit this if it helps

Updated test looks good, thanks for the clarification!

@wesm
Copy link
Member Author

wesm commented Aug 19, 2019

@wesm wesm closed this in c4b8cb6 Aug 19, 2019
@wesm wesm deleted the ARROW-5480 branch August 19, 2019 18:02
@codecov-io
Copy link

Codecov Report

Merging #5110 into master will decrease coverage by 22.6%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5110       +/-   ##
===========================================
- Coverage   87.62%   65.02%   -22.61%     
===========================================
  Files        1014      495      -519     
  Lines      145828    67082    -78746     
  Branches     1437        0     -1437     
===========================================
- Hits       127788    43619    -84169     
- Misses      17678    23463     +5785     
+ Partials      362        0      -362
Impacted Files Coverage Δ
python/pyarrow/tests/test_parquet.py 96.42% <100%> (+0.02%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util-internal.cc 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/logical_type.h 0% <0%> (-100%) ⬇️
cpp/src/parquet/hasher.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/basic_decimal_scalar.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/kernels/boolean.cc 0% <0%> (-100%) ⬇️
... and 762 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36bd667...a161b24. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants