-
-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to Write nullable pandas types (Int*|UInt*|boolean|string) #525
Conversation
3263486
to
0fb5e7e
Compare
add support for `string` type bump requirements fix bug on matching type
0fb5e7e
to
ee2ddf9
Compare
Thanks for the PR! |
Thank you. I'm looking into it. What is strange is that the test also fails on my machine even I'm on the master branch. Maybe the failure is dependent on another library. |
Right, probably dependent on other package versions |
The test failed as a result of behaviour change between pandas versions 1.1.13 & 1.1.14. The dataframe created as a result of In the function first the data is grouped by the partitioning columns. In 1.1.13,
In 1.1.14,
This then changes the behaviour of the
The latest version of this PR resolves this issue, by replacing the Currently all tests are passing. |
b724734
to
28c73e4
Compare
@martindurant please let me know if you can get a chance to review this PR. All tests are passing, and it solves Interop issues with the latest pandas version |
This is excellent, thank you, and sorry to keep you waiting. We can discuss separately about possible timing for implementing the reading side, perhaps as an option at first. |
Following the discussion on PR-483
I have added support to write pandas nullable types. The behaviour of reading columns having null values is unchanged, i.e.
boolean
&integer
columns withnull
values get upcast to the appropriate sizedfloat
,string
columns withnull
values will get read asobject
. All existing code will continue to work as-isInferring of optional types on read was omitted on purpose. But if needed, the numpy types can be converted to their fancy optional counterparts like below:
This PR will solve the issue ValueError: Don't know how to convert data type: Int64
It will also allow using fastparquet to generate Loading Files in ETL processes where the files' target is a database. Postgres & Redshift will refuse to load from a parquet file if a column is declared as an integer in the database but stored as a float in the parquet file.
While working on this PR, I also encountered & resolved a bug where unsigned integer columns in a parquet file with some null values would not be converted properly when using
out.to_pandas()
. To explore this bug, please generate a test output (i.e. run the code snipped above) using the branchhaleemur:feat/write-optional-types
and then attempt reading the test output using the master branch.