Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot compare tz-naive and tz-aware timestamps on concat #6925

Closed
sdementen opened this issue Dec 3, 2020 · 6 comments
Closed

Cannot compare tz-naive and tz-aware timestamps on concat #6925

sdementen opened this issue Dec 3, 2020 · 6 comments

Comments

@sdementen
Copy link
Contributor

What happened:

When concatenating two dask dataframes with indices dype=datetime64[ns, UTC], I get a TypeError: Cannot compare tz-naive and tz-aware timestamps. One of the the dask dataframe was created with dd.from_pandas and the other with dd.read_parquet

What you expected to happen:

An happy concatenation ;-)

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas

filename = "outfile.parq"

# create a DF with tz=UTC and write to parquet file
df = pandas.DataFrame(
    columns=["A"], index=pandas.date_range("2018", "2019", freq="MS", tz="UTC"), data=1.0
)
ddf = dd.from_pandas(df, npartitions=1)
ddf.to_parquet(filename)

# read back this dask df into rddf
rddf = dd.read_parquet(filename)

# print indices
print(ddf.index)
print(rddf.index)


# attempt concatenation of dask df from parquet and new dask df
nddf = dd.concat([rddf, ddf])
# this raises the following traceback
# Traceback (most recent call last):
#   File ".../tst_parquet_tz.py", line 22, in <module>
#     nddf = dd.concat([rddf, ddf])
#   File "...\site-packages\dask\dataframe\multi.py", line 1108, in concat
#     if all(
#   File "...\site-packages\dask\dataframe\multi.py", line 1109, in <genexpr>
#     dfs[i].divisions[-1] < dfs[i + 1].divisions[0]
#   File "pandas\_libs\tslibs\timestamps.pyx", line 281, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
#   File "pandas\_libs\tslibs\timestamps.pyx", line 295, in pandas._libs.tslibs.timestamps._Timestamp._assert_tzawareness_compat
# TypeError: Cannot compare tz-naive and tz-aware timestamps

Anything else we need to know?:

When printing both indices, we see that the dtype of both indices are datetime64[ns, UTC] yet the representation of the index coming from read-parquet shows tz-naive dates.

Dask Index Structure:
npartitions=1
2018-01-01 00:00:00+00:00    datetime64[ns, UTC]
2019-01-01 00:00:00+00:00                    ...
dtype: datetime64[ns, UTC]
Dask Name: from_pandas, 2 tasks
Dask Index Structure:
npartitions=1
2018-01-01    datetime64[ns, UTC]
2019-01-01                    ...
dtype: datetime64[ns, UTC]
Dask Name: read-parquet, 2 tasks

Environment:

Dask version: 2.30.0
Python version: 3.8
Operating System: win10
Install method (conda, pip, source): pip

@jsignell
Copy link
Member

jsignell commented Dec 4, 2020

Thanks for raising this. I was able to reproduce using fastparquet, but found that it works as expected when using pyarrow (engine="pyarrow").

It seems fastparquet isn't storing the timezone information properly in the metadata file, but it isn't clear to me whether or not this is a bug in the dask code base or if it lives in fastparquet. Either way there is an issue with divisions (which tries to know the edges of the partitions). As a workaround you can do rddf = rddf.clear_divisions() or you can use pyarrow.

Ping @martindurant for fastparquet expertise.

@martindurant
Copy link
Member

Will look...

I'll mention that parquet does not store time zones, this information only goes into the pandas metadata (i.e., only for data that came from pandas originally), so it is quite likely that this hasn't been implemented in fastparquet at all.

@jsignell
Copy link
Member

jsignell commented Dec 7, 2020

More about fastparquet and timezones in dask/fastparquet#532

@sdementen
Copy link
Contributor Author

If I can get some guidance on this topic, I can try a PR. I have no view on what is happening before/after neither on the divisions/partitions... So maybe a bit though for me.

@martindurant
Copy link
Member

Actually, it may be worth trying again versus fastparquet master, because previously failure to set the timezone was being ignored, and that particular failure should no longer happen. However, it might need more work to apply the same thing to the min/max values.

@sdementen
Copy link
Contributor Author

I have checked with master for dask and fastparquet and the issue is still there.

sdementen pushed a commit to sdementen/dask that referenced this issue Dec 16, 2020
- add rountrip test for df with tz
- fix statistics on col with tz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants