New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask dataframe from large pandas dataframe cannot be computed on cluster #10644
Comments
Indeed, we are deserializing the dataframe on the scheduler. This goes back to a change that was done in We typically strongly discourage to send such large dataframes to the scheduler but rather use an dask API to read the data instead of a pandas API (for example The equivalent for your toy example would be import dask.array as da
import dask.dataframe as dd
dd.from_array(da.random.random((550000, 500))) |
… On Thu, Nov 23, 2023 at 8:47 AM Florian Jetter ***@***.***> wrote:
Indeed, we are deserializing the dataframe on the scheduler. This goes
back to a change that was done in 2023.4.0 (dask/distributed#7564
<dask/distributed#7564>) where we started to
deserialize everything on the scheduler side. This was mostly done for
reduction of code complexity.
We typically strongly discourage to send such large dataframes to the
scheduler but rather use an dask API to read the data instead of a pandas
API (for example read_parquet, see
https://docs.dask.org/en/stable/dataframe-api.html#create-dataframes for
a list) The from_pandas is mostly used for examples and demos but I
recommend using a dask-native API instead.
The equivalent for your toy example would be
import dask.array as da
import dask.dataframe as dd
dd.from_array(da.random.random((550000, 500)))
—
Reply to this email directly, view it on GitHub
<#10644 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEEB5NUA5I4WCUTSE3YF5OWTAVCNFSM6AAAAAA7WUJKE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGU2TMMBZHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks, I'll work around to follow the best practices. |
It looks like when trying to compute a dask dataframe made from a large pandas dataframe on a cluster, something goes wrong (it cannot be deserialized). On my machine it happens when the pandas dataframe is larger than 2GB, and does not matter what
npartition
is set to. It only happens when being computed on a cluster, and it's fine when computed locally without a cluster. From the source code it looks like it's deserializing the entire dataframe on the scheduler for some reason but I could be wrong here.This starts to occur from 2023.4.0, 2023.3.2 doesn't have this problem.
Error message:
The text was updated successfully, but these errors were encountered: