[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

asfimport · 2022-10-21T11:26:48Z

Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata).

Using the development version of pandas:

In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, dtype="datetime64[s]")})

In [2]: df.dtypes
Out[2]: 
col    datetime64[s]
dtype: object

In [3]: table = pa.table(df)

In [4]: table.schema
Out[4]: 
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423

In [6]: table.to_pandas().dtypes
Out[6]: 
col    datetime64[ns]
dtype: object

This is because we have a coerce_temporal_nanoseconds conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data).

When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution (is related to)

_{Note: This issue was originally created as ARROW-18124. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

danepitkin · 2023-05-17T22:27:14Z

When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).

Can you provide some more detail about this paragraph? Does pandas metadata mean the pyarrow/array.pxi::PandasOptions class? When would pandas metadata not be present and why would that break parquet reading?

…>= 2.0

jorisvandenbossche · 2023-05-24T10:04:46Z

The "pandas metadata" is custom metadata that we store in the pyarrow schema whenever the data is created from a pandas.DataFrame:

>>> df = pd.DataFrame({"col": pd.date_range("2012-01-01", periods=3, freq="D")})
>>> df
         col
0 2012-01-01
1 2012-01-02
2 2012-01-03
>>> table = pa.table(df)
>>> table.schema
col: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
# easier way to access this (and converted to a dict)
>>> table.schema.pandas_metadata
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 3,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'col',
   'field_name': 'col',
   'pandas_type': 'datetime',
   'numpy_type': 'datetime64[ns]',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '13.0.0.dev106+gfbe5f641d'},
 'pandas_version': '2.1.0.dev0+484.g7187e67500'}

So this indicates that the original data in the pandas.DataFrame had "datetime64[ns]" type. In this case that matches the arrow type, but for example after a roundtrip through Parquet, this might no longer be the case:

>>> pq.write_table(table, "test.parquet")
>>> table2 = pq.read_table("test.parquet")
>>> table2.schema
Out[33]: 
col: timestamp[us]                 # <--- now us instead of ns
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
>>>  table2.schema.pandas_metadata
{ ...
 'columns': [{'name': 'col',
   'field_name': 'col',
   'pandas_type': 'datetime',
   'numpy_type': 'datetime64[ns]',   # <--- but this still indicates ns
   'metadata': None}],
...

So the question is here what table2.to_pandas() should do? Use the microsecond resolution of the data, of the nanosecond resolution of the metadata?

(note that this is also a consequence of the default parquet version we write not yet supporting nanoseconds, while we should probably bump that default version we write, and then the nanoseconds would be preserved in the Parquet roundtrip)

Now, I am not sure if it would be easy to use the information of the pandas metadata to influence the conversion, as we typically only use the metadata after converting the actual data to finalize the resulting pandas DataFrame (eg set the index, cast the column names, ..).
And I am also not fully sure if it would actually be desirable to follow the pandas metadata, since that would involve an extra conversion step (and effectively all existing pandas metadata (eg in already written parquet files) will always say that it is nanoseconds, since until recently that was the only supported resolution by pandas).

danepitkin · 2023-05-24T19:52:04Z

Excellent write-up and example! I'd prefer supporting nanosecond in parquet and not converting based off of metadata since it seems like its not guaranteed to be reliable. Filed #35746 to track Parquet support for ns.

This PR fixes parquet pytest failures, mostly working around two upstream issues: 1. pandas-dev/pandas#53345 2. apache/arrow#33321 Thus fixes 20 pytest failure: This PR: ``` = 231 failed, 95767 passed, 2045 skipped, 764 xfailed, 300 xpassed in 426.65s (0:07:06) = ``` On `pandas_2.0_feature_branch`: ``` = 251 failed, 95747 passed, 2045 skipped, 764 xfailed, 300 xpassed in 433.50s (0:07:13) = ```

…>= 2.0

…as >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…36586) See https://github.com/apache/arrow/pull/35656/files#r1257540254 #36542 fixed the no-pandas build by actually not having pandas installed, but some PRs merged at the same time introduced new failures for this build. * Closes: #36541 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…ges in arrow->pandas conversion (#36630) ### Rationale for this change Due to the changes on #33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: #36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…d changes in arrow->pandas conversion (apache#36630) ### Rationale for this change Due to the changes on apache#33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: apache#36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

asfimport added this to the 11.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution #33286

Closed

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

jorisvandenbossche mentioned this issue Mar 23, 2023

[Python] Support non-nanosecond units in method to_pandas_dtype for timestamp and duration types #34462

Closed

jorisvandenbossche mentioned this issue Apr 4, 2023

[Python] Roundtripping with from_pandas().to_pandas() changes timestamp/timedelta resolution #34877

Closed

This was referenced Apr 4, 2023

Specify ts resolution explicitly in test_processing_chain dask/distributed#7744

Merged

BUG: Unable to properly round-trip datetime64 types in parquet [pandas-2.0] pandas-dev/pandas#52412

Closed

jorisvandenbossche added this to the 13.0.0 milestone Apr 9, 2023

AlenkaF mentioned this issue Apr 20, 2023

[CI][Python] Pandas upstream_devel and nightlies are failing #35235

Closed

lukemanley mentioned this issue May 10, 2023

BUG/REF: ArrowExtensionArray non-nanosecond units pandas-dev/pandas#53171

Merged

4 tasks

AlenkaF assigned AlenkaF and danepitkin and unassigned AlenkaF May 17, 2023

danepitkin added a commit to danepitkin/arrow that referenced this issue May 17, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

17bb1b4

…>= 2.0

github-actions bot mentioned this issue May 17, 2023

GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 #35656

Merged

galipremsagar mentioned this issue May 30, 2023

Fix parquet pytest failures with pandas-2.0 rapidsai/cudf#13474

Merged

3 tasks

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 5, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

de080d7

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 6, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

72e6126

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 8, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

7c3e129

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 14, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

9703ec2

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 26, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

59b0e76

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 26, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

85c5820

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 30, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

0e8d727

…>= 2.0

danepitkin added a commit to danepitkin/arrow that referenced this issue Jul 6, 2023

apacheGH-33321: Support converting to non-nano datetime64 for pandas …

2b6b8c1

…>= 2.0

raulcd modified the milestones: 13.0.0, 14.0.0 Jul 7, 2023

jorisvandenbossche closed this as completed in #35656 Jul 7, 2023

jorisvandenbossche modified the milestones: 14.0.0, 13.0.0 Jul 7, 2023

This was referenced Jul 12, 2023

[CI][Python] Dask integration nightly jobs fail on dask.dataframe.io.tests.test_parquet::test_pandas_timestamp_overflow_pyarrow #36629

Closed

GH-36629: [CI][Python] Skip dask tests due to our non-nanosecond changes in arrow->pandas conversion #36630

Merged

j-bennet mentioned this issue Jul 15, 2023

Fix a few errors with upstream pandas and pyarrow dask/dask#10412

Merged

2 tasks

QuLogic mentioned this issue Aug 19, 2023

Fix test_pandas_timestamp_overflow_pyarrow condition dask/dask#10451

Open

2 tasks

lhoestq mentioned this issue Aug 23, 2023

Fix CI for pyarrow 13.0.0 huggingface/datasets#6173

Closed

amoeba added the Breaking Change Includes a breaking change to the API label Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

asfimport commented Oct 21, 2022 •

edited

Loading

danepitkin commented May 17, 2023

jorisvandenbossche commented May 24, 2023

danepitkin commented May 24, 2023

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

Comments

asfimport commented Oct 21, 2022 • edited Loading

Related issues:

danepitkin commented May 17, 2023

jorisvandenbossche commented May 24, 2023

danepitkin commented May 24, 2023

asfimport commented Oct 21, 2022 •

edited

Loading