Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previously working time series resampling breaks in new version of Dask #11115

Closed
andersglindstrom opened this issue May 10, 2024 · 3 comments · Fixed by dask/dask-expr#1063
Closed

Comments

@andersglindstrom
Copy link

Describe the issue:

Resampling over a time series used to work with dask==2024.2.1. However, with dask=2024.5.0 and dask-expr=1.1.0 it throws an exception.

Minimal Complete Verifiable Example:

import dask.dataframe as dd

df = dd.DataFrame.from_dict({
    "Date":["11/26/2017", "11/26/2017"],
    "Time":["17:00:00.067", "17:00:00.102"],
    "Volume": [403, 3]
}, npartitions=1)

df["Timestamp"] = dd.to_datetime(df.Date) + dd.to_timedelta(df.Time)

df.to_parquet("test.parquet", write_metadata_file=True)
df = dd.read_parquet("test.parquet", index="Timestamp", calculate_divisions=True)

# This resampling breaks
minutely_volume = df[["Volume"]].resample("min").sum().compute()
print(minutely_volume)

Anything else we need to know?:
With old version:

<Deprecation warning about deprecated Dask version...>
Timestamp
2017-11-26 17:00:00     406

With new version:

Traceback (most recent call last):
  File "/home/anders/src/daskbug/daskbug.py", line 14, in <module>
    minutely_volume = df[["Volume"]].resample("min").sum().compute()
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_collection.py", line 475, in compute
    out = out.optimize(fuse=fuse)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_collection.py", line 590, in optimize
    return new_collection(self.expr.optimize(fuse=fuse))
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_expr.py", line 94, in optimize
    return optimize(self, **kwargs)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_expr.py", line 3028, in optimize
    return optimize_until(expr, stage)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_expr.py", line 2989, in optimize_until
    expr = expr.lower_completely()
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_core.py", line 436, in lower_completely
    new = expr.lower_once()
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_core.py", line 393, in lower_once
    out = expr._lower()
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_resample.py", line 74, in _lower
    self.frame, new_divisions=self._resample_divisions[0], force=True
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask_expr/_resample.py", line 64, in _resample_divisions
    return _resample_bin_and_out_divs(
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/dask/dataframe/tseries/resample.py", line 64, in _resample_bin_and_out_divs
    temp = divs.resample(rule, closed=closed, label="left").count()
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/generic.py", line 9771, in resample
    return get_resampler(
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/resample.py", line 2050, in get_resampler
    return tg._get_resampler(obj, kind=kind)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/resample.py", line 2229, in _get_resampler
    _, ax, _ = self._set_grouper(obj, gpr_index=None)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/resample.py", line 2529, in _set_grouper
    obj, ax, indexer = super()._set_grouper(obj, sort, gpr_index=gpr_index)
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/groupby/grouper.py", line 403, in _set_grouper
    indexer = self._indexer_deprecated = ax.array.argsort(
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/arrays/base.py", line 848, in argsort
    return nargsort(
  File "/home/anders/src/daskbug/venv2/lib/python3.10/site-packages/pandas/core/sorting.py", line 439, in nargsort
    indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'Timestamp' and 'int'

Environment:

  • Dask version: 2024.5.0
  • Dask-expr version: 1.1.0
  • Python version: 3.10.12
  • Operating System: Ubuntu 22.04.3 LTS
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label May 10, 2024
@andersglindstrom
Copy link
Author

Here are two requirements file, which are the output of pip freeze in the two environments; the one that works, and the one that doesn't.
pip_freeze_working.txt
pip_freeze_broken.txt

@phofl phofl added dataframe dask-expr and removed needs triage Needs a response from a contributor labels May 14, 2024
@phofl
Copy link
Collaborator

phofl commented May 14, 2024

thanks for the report, I'll prepare a fix

@andersglindstrom
Copy link
Author

Thanks for fixing 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants