New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520
Comments
Wes McKinney / @wesm: |
David Lee / @davlee1972: The conversion from pandas nanoseconds to whatever timestamp resolution declared using pa.timestamp() in the schema object worked fine in 0.11.0. Having to pass in coerce_timestamps, allow_truncated_timestamps and safe is pretty messy.
|
Wes McKinney / @wesm: |
David Lee / @davlee1972: |
Wes McKinney / @wesm: |
David Lee / @davlee1972: The problem is more or less summarized here. There are a lot of gotchas with each step. json.loads() works fine. pandas.DataFrame() is a problem if every record doesn't contain the same columns. Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN values. Adding NaN values will force change a column's dtype from INT64 to FLOAT64. NaNs are a problem to begin with because if you convert it to Parquet you end up with Zeros instead of Nulls. Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing in None is equal to pandas.DataFrame.reindex() without any params. Only way to replace NaNs with None is with pandas.DataFrame.where(). After replacing NaNs you can then change the dtype of the column from FLOAT64 back to INT64 It's basically a lot of hoops to go through to preserve your original JSON INT as a Parquet INT. Maybe the best solution is to create a pyarrow.Table.from_pydict() function to create a arrow table from a python dictionary. We have this gap with pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and pyarrow.Table.from_pandas(). |
When passing in a schema object to from_pandas a resolution error occurs if the schema uses a lower resolution timestamp. Do we need to also add "coerce_timestamps" and "allow_truncated_timestamps" parameters found in write_table() to from_pandas()?
Error:
pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1532015191753713000', 'Conversion failed for column modified with type datetime64[ns]')
Code:
Reporter: David Lee / @davlee1972
Note: This issue was originally created as ARROW-3907. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: