Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

Closed
asfimport opened this issue Nov 29, 2018 · 6 comments

Comments

@asfimport
Copy link

When passing in a schema object to from_pandas a resolution error occurs if the schema uses a lower resolution timestamp. Do we need to also add "coerce_timestamps" and "allow_truncated_timestamps" parameters found in write_table() to from_pandas()?

Error:

pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1532015191753713000', 'Conversion failed for column modified with type datetime64[ns]')

Code:

 

processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('records', pa.int32())
])

pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)

 

Reporter: David Lee / @davlee1972

Note: This issue was originally created as ARROW-3907. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Does passing safe=False to Table.from_pandas do the trick?

@asfimport
Copy link
Author

David Lee / @davlee1972:
passing in safe=False works, but it is pretty hacky.. Another problem also pops up with ParquetWriter.write_table(). I'll open a separate ticket for that one.

The conversion from pandas nanoseconds to whatever timestamp resolution declared using pa.timestamp() in the schema object worked fine in 0.11.0.

Having to pass in coerce_timestamps, allow_truncated_timestamps and safe is pretty messy.

 

@asfimport
Copy link
Author

Wes McKinney / @wesm:
How is it hacky? We can always add allow_truncated_timestamps as an option to Table.from_pandas, but somehow you have to opt in to the lossy conversion

@asfimport
Copy link
Author

David Lee / @davlee1972:
Closing for now. Not convinced Safe is the best solution to address timestamp resolution. If a schema is used it should be clear the intent is to convert pandas nanoseconds to a lower resolution. I think the same can be said for other types of conversions like floats to int.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
ETL can be a messy business. If you have ideas about improving the APIs for schema coercion / casting, I'd be interested to discuss more

@asfimport
Copy link
Author

David Lee / @davlee1972:
Yeah i'm trying to figure out what the best way to preserve INTs when converting json to parquet..

The problem is more or less summarized here.
https://pandas.pydata.org/pandas-docs/stable/gotchas.html

There are a lot of gotchas with each step.

json.loads() works fine.

pandas.DataFrame() is a problem if every record doesn't contain the same columns.

Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN values.

Adding NaN values will force change a column's dtype from INT64 to FLOAT64.

NaNs are a problem to begin with because if you convert it to Parquet you end up with Zeros instead of Nulls.

Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing in None is equal to pandas.DataFrame.reindex() without any params.

Only way to replace NaNs with None is with pandas.DataFrame.where().

After replacing NaNs you can then change the dtype of the column from FLOAT64 back to INT64

It's basically a lot of hoops to go through to preserve your original JSON INT as a Parquet INT.

Maybe the best solution is to create a pyarrow.Table.from_pydict() function to create a arrow table from a python dictionary. We have this gap with pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and pyarrow.Table.from_pandas().

@asfimport asfimport added this to the 0.11.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant