Cannot compare tz-naive and tz-aware timestamps on concat #6925

sdementen · 2020-12-03T14:32:53Z

What happened:

When concatenating two dask dataframes with indices dype=datetime64[ns, UTC], I get a TypeError: Cannot compare tz-naive and tz-aware timestamps. One of the the dask dataframe was created with dd.from_pandas and the other with dd.read_parquet

What you expected to happen:

An happy concatenation ;-)

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas

filename = "outfile.parq"

# create a DF with tz=UTC and write to parquet file
df = pandas.DataFrame(
    columns=["A"], index=pandas.date_range("2018", "2019", freq="MS", tz="UTC"), data=1.0
)
ddf = dd.from_pandas(df, npartitions=1)
ddf.to_parquet(filename)

# read back this dask df into rddf
rddf = dd.read_parquet(filename)

# print indices
print(ddf.index)
print(rddf.index)


# attempt concatenation of dask df from parquet and new dask df
nddf = dd.concat([rddf, ddf])
# this raises the following traceback
# Traceback (most recent call last):
#   File ".../tst_parquet_tz.py", line 22, in <module>
#     nddf = dd.concat([rddf, ddf])
#   File "...\site-packages\dask\dataframe\multi.py", line 1108, in concat
#     if all(
#   File "...\site-packages\dask\dataframe\multi.py", line 1109, in <genexpr>
#     dfs[i].divisions[-1] < dfs[i + 1].divisions[0]
#   File "pandas\_libs\tslibs\timestamps.pyx", line 281, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
#   File "pandas\_libs\tslibs\timestamps.pyx", line 295, in pandas._libs.tslibs.timestamps._Timestamp._assert_tzawareness_compat
# TypeError: Cannot compare tz-naive and tz-aware timestamps

Anything else we need to know?:

When printing both indices, we see that the dtype of both indices are datetime64[ns, UTC] yet the representation of the index coming from read-parquet shows tz-naive dates.

Dask Index Structure:
npartitions=1
2018-01-01 00:00:00+00:00    datetime64[ns, UTC]
2019-01-01 00:00:00+00:00                    ...
dtype: datetime64[ns, UTC]
Dask Name: from_pandas, 2 tasks
Dask Index Structure:
npartitions=1
2018-01-01    datetime64[ns, UTC]
2019-01-01                    ...
dtype: datetime64[ns, UTC]
Dask Name: read-parquet, 2 tasks

Environment:

Dask version: 2.30.0
Python version: 3.8
Operating System: win10
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

jsignell · 2020-12-04T22:05:31Z

Thanks for raising this. I was able to reproduce using fastparquet, but found that it works as expected when using pyarrow (engine="pyarrow").

It seems fastparquet isn't storing the timezone information properly in the metadata file, but it isn't clear to me whether or not this is a bug in the dask code base or if it lives in fastparquet. Either way there is an issue with divisions (which tries to know the edges of the partitions). As a workaround you can do rddf = rddf.clear_divisions() or you can use pyarrow.

Ping @martindurant for fastparquet expertise.

martindurant · 2020-12-04T22:18:38Z

Will look...

I'll mention that parquet does not store time zones, this information only goes into the pandas metadata (i.e., only for data that came from pandas originally), so it is quite likely that this hasn't been implemented in fastparquet at all.

jsignell · 2020-12-07T19:20:13Z

More about fastparquet and timezones in dask/fastparquet#532

sdementen · 2020-12-10T09:51:28Z

If I can get some guidance on this topic, I can try a PR. I have no view on what is happening before/after neither on the divisions/partitions... So maybe a bit though for me.

martindurant · 2020-12-10T21:34:44Z

Actually, it may be worth trying again versus fastparquet master, because previously failure to set the timezone was being ignored, and that particular failure should no longer happen. However, it might need more work to apply the same thing to the min/max values.

sdementen · 2020-12-16T13:53:59Z

I have checked with master for dask and fastparquet and the issue is still there.

- add rountrip test for df with tz - fix statistics on col with tz

sdementen mentioned this issue Dec 3, 2020

Append error: TypeError: Cannot compare tz-naive and tz-aware timestamps ranaroussi/pystore#35

Open

sdementen mentioned this issue Dec 9, 2020

uninitialized array leads to failure in tz_localize ("UserWarning: Inferring time-zone from CET in column __null_dask_index__ failed, using time-zone-agnostic") dask/fastparquet#532

Closed

sdementen pushed a commit to sdementen/dask that referenced this issue Dec 16, 2020

fix dask#6925

3329a41

- add rountrip test for df with tz - fix statistics on col with tz

martindurant closed this as completed in d9169b4 Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot compare tz-naive and tz-aware timestamps on concat #6925

Cannot compare tz-naive and tz-aware timestamps on concat #6925

sdementen commented Dec 3, 2020

jsignell commented Dec 4, 2020

martindurant commented Dec 4, 2020

jsignell commented Dec 7, 2020

sdementen commented Dec 10, 2020

martindurant commented Dec 10, 2020

sdementen commented Dec 16, 2020

Cannot compare tz-naive and tz-aware timestamps on concat #6925

Cannot compare tz-naive and tz-aware timestamps on concat #6925

Comments

sdementen commented Dec 3, 2020

jsignell commented Dec 4, 2020

martindurant commented Dec 4, 2020

jsignell commented Dec 7, 2020

sdementen commented Dec 10, 2020

martindurant commented Dec 10, 2020

sdementen commented Dec 16, 2020