[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

asfimport · 2018-11-29T18:51:27Z

When passing in a schema object to from_pandas a resolution error occurs if the schema uses a lower resolution timestamp. Do we need to also add "coerce_timestamps" and "allow_truncated_timestamps" parameters found in write_table() to from_pandas()?

Error:

pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1532015191753713000', 'Conversion failed for column modified with type datetime64[ns]')

Code:

processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('records', pa.int32())
])

pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)

Reporter: David Lee / @davlee1972

_{Note: This issue was originally created as ARROW-3907. Please see the migration documentation for further details.}

asfimport · 2018-11-29T20:45:47Z

Wes McKinney / @wesm:
Does passing safe=False to Table.from_pandas do the trick?

asfimport · 2018-11-30T19:50:24Z

David Lee / @davlee1972:
passing in safe=False works, but it is pretty hacky.. Another problem also pops up with ParquetWriter.write_table(). I'll open a separate ticket for that one.

The conversion from pandas nanoseconds to whatever timestamp resolution declared using pa.timestamp() in the schema object worked fine in 0.11.0.

Having to pass in coerce_timestamps, allow_truncated_timestamps and safe is pretty messy.

asfimport · 2018-11-30T21:46:41Z

Wes McKinney / @wesm:
How is it hacky? We can always add allow_truncated_timestamps as an option to Table.from_pandas, but somehow you have to opt in to the lossy conversion

asfimport · 2018-12-03T17:30:42Z

David Lee / @davlee1972:
Closing for now. Not convinced Safe is the best solution to address timestamp resolution. If a schema is used it should be clear the intent is to convert pandas nanoseconds to a lower resolution. I think the same can be said for other types of conversions like floats to int.

asfimport · 2018-12-04T01:41:26Z

Wes McKinney / @wesm:
ETL can be a messy business. If you have ideas about improving the APIs for schema coercion / casting, I'd be interested to discuss more

asfimport · 2018-12-13T21:54:46Z

David Lee / @davlee1972:
Yeah i'm trying to figure out what the best way to preserve INTs when converting json to parquet..

The problem is more or less summarized here.
https://pandas.pydata.org/pandas-docs/stable/gotchas.html

There are a lot of gotchas with each step.

json.loads() works fine.

pandas.DataFrame() is a problem if every record doesn't contain the same columns.

Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN values.

Adding NaN values will force change a column's dtype from INT64 to FLOAT64.

NaNs are a problem to begin with because if you convert it to Parquet you end up with Zeros instead of Nulls.

Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing in None is equal to pandas.DataFrame.reindex() without any params.

Only way to replace NaNs with None is with pandas.DataFrame.where().

After replacing NaNs you can then change the dtype of the column from FLOAT64 back to INT64

It's basically a lot of hoops to go through to preserve your original JSON INT as a Parquet INT.

Maybe the best solution is to create a pyarrow.Table.from_pydict() function to create a arrow table from a python dictionary. We have this gap with pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and pyarrow.Table.from_pandas().

asfimport closed this as completed Dec 3, 2018

asfimport added this to the 0.11.1 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

asfimport commented Nov 29, 2018

asfimport commented Nov 29, 2018

asfimport commented Nov 30, 2018

asfimport commented Nov 30, 2018

asfimport commented Dec 3, 2018

asfimport commented Dec 4, 2018

asfimport commented Dec 13, 2018

[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

[Python] from_pandas errors when schemas are used with lower resolution timestamps #20520

Comments

asfimport commented Nov 29, 2018

asfimport commented Nov 29, 2018

asfimport commented Nov 30, 2018

asfimport commented Nov 30, 2018

asfimport commented Dec 3, 2018

asfimport commented Dec 4, 2018

asfimport commented Dec 13, 2018