-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5480: [Python] Add unit test asserting specifically that pandas.Categorical roundtrips to Parquet format without special options #5110
Conversation
@jorisvandenbossche @jreback the pandas test suite will probably need some changes or expansions now that Categorical (for strings at least) can be faithfully roundtripped using |
@wesm indeed, created pandas-dev/pandas#27955 to track that. |
python/pyarrow/tests/test_parquet.py
Outdated
buf = BytesIO() | ||
df.to_parquet(buf) | ||
|
||
# This reads back object, but I expected category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment reads a bit unclear to me (it now reads back categorical?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I copy-pasted this from the bug report
@wesm to what extent is this "fully" faithful for corner cases? (if not, might need to mention that as caveats in the pandas docs) For example for a categorical with values in the "categories" which are not present in the data, is this preserved on reading back? (I suppose we use the categories when creating a DictionaryArray, but are its dictionary's values exactly preserved in the parquet roundtrip?) |
The category values will be exactly preserved whether or not they occur in the data. I can expand the unit test to exhibit this if it helps |
Done |
Rebased |
python/pyarrow/tests/test_parquet.py
Outdated
@@ -3015,7 +3015,6 @@ def test_dictionary_array_automatically_read(): | |||
assert result.schema.metadata is None | |||
|
|||
|
|||
@pytest.mark.pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the rest of the file, test functions using pandas are marked as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a rebase artifact, fixing
Updated test looks good, thanks for the clarification! |
Codecov Report
@@ Coverage Diff @@
## master #5110 +/- ##
===========================================
- Coverage 87.62% 65.02% -22.61%
===========================================
Files 1014 495 -519
Lines 145828 67082 -78746
Branches 1437 0 -1437
===========================================
- Hits 127788 43619 -84169
- Misses 17678 23463 +5785
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
This only works for string types for the moment. Once ARROW-6277 is addressed we can expand to other types.