-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not process datasets created by the older version of Dask #11160
Comments
what pyarrow version did you use to create the dataset and which one are you using to read the dataset? |
@fjetter, thanks for a quick reply. We installed pyarrow The important observation is that TypeError error disappears if I take only part of the dataset as follows: ddf.loc[:100000] However, disabling the
(I've added a note to the description) |
@fjetter , in case of I've added the debug code on all nodes as follows:
|
@fjetter , in case of I've added the debug code on all nodes as follows:
Here is an output:
|
@fjetter , the problem is that I'm shuffling using the existing column. Dask should provide a proper error message in this case. I've added preproduction steps. Please see:
|
Thanks for chasing this down! I agre this should raise, I opened #11174 to address this |
@fjetter , the problem isn't strictly related to |
indeed, I transferred the issue to |
@fjetter , after spending some time testing the solution, I've found that there are two separate problems: Problem 1 Problem 2 |
Describe the issue:
After upgrading the Dask from
2023.9.3
to the latest version2024.5.2
or2024.4.1
, we can not load the existing parquet files created by the previous version.I'm getting an error during
to_parquet
operation (whendask-expr
is enabled):Disabling the
dask-expr
leads to different error during repartition operation:The dataset consists of parquet 3 files (e.g.,
dataset.parquet/part.X.parquet
) with the following Dask dtypes:Pandas dtypes:
The index is named as
__null_dask_index__
.The important observation is that TypeError error disappears if I take only part of the dataset as follows:
However, disabling the
dask-expr
still leads to an error:Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: