-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend library dispatching for IO in Dask-Array and Dask-DataFrame #9475
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I wasn't able to get to a detailed review on this today. I have a few non-blocking comments on the .rst
doc added here but, as they're non-blocking, I'd prefer to just submit a follow-up PR for folks to review. @rjzamora @ian-r-rose, unless I'm needed for something, I'll defer to you two as to when this PR is good to merge.
I think the "public" component of this PR is ready to go, and the internal pieces can be cleaned up if/when needed. However, I'm a bit biased, so I'll gladly defer to the opinions of @ian-r-rose and @wence- :) |
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
It looks like we're seeing conflicts with #9283, which was just merged, and adds an entrypoint section. If you merge |
Ah-ha! I was getting very confused. Makes sense. |
Note: This branch includes the necessary changes to merge #9038 with this PR. |
Nice! That was quick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with where this stands now, thanks for all your work on this @rjzamora!
Thanks for the tremendous help here @ian-r-rose @wence- @jrbourbeau ! Please do ping me if this causes any unexpected problems :) |
This PR depends on dask/dask#9475 (**Now Merged**) After dask#9475, external libraries are now able to implement (and expose) their own `DataFrameBackendEntrypoint` definitions to specify custom creation functions for DataFrame collections. This PR introduces the `CudfBackendEntrypoint` class to create `dask_cudf.DataFrame` collections using the `dask.dataframe` API. By installing `dask_cudf` with this entrypoint definition in place, you get the following behavior in `dask.dataframe`: ```python import dask.dataframe as dd import dask # Tell Dask that you want to create DataFrame collections # with the "cudf" backend (for supported creation functions). # This can also be used in a context, or set in a yaml file dask.config.set({"dataframe.backend": "cudf"}) ddf = dd.from_dict({"a": range(10)}, npartitions=2) type(ddf) # dask_cudf.core.DataFrame ``` Note that the code snippet above does not require an explicit import of `cudf` or `dask_cudf`. The following creation functions will support backend dispatching after dask#9475: - `from_dict` - `read_paquet` - `read_json` - `read_orc` - `read_csv` - `read_hdf` See also: dask/design-docs#1 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11920
This PR serves as a reference implementation for (what is expected to be) "Dask Design Proposal 1": dask/design-docs#1
Required cudf branch for Dask-DataFrame dispatching to cudf: https://github.com/rjzamora/cudf/tree/backend-class