Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend library dispatching for IO in Dask-Array and Dask-DataFrame #9475

Merged
merged 74 commits into from
Oct 14, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 8, 2022

This PR serves as a reference implementation for (what is expected to be) "Dask Design Proposal 1": dask/design-docs#1

Required cudf branch for Dask-DataFrame dispatching to cudf: https://github.com/rjzamora/cudf/tree/backend-class

@rjzamora rjzamora added dataframe array io enhancement Improve existing functionality or make things work better labels Sep 8, 2022
@github-actions github-actions bot added the dispatch Related to `Dispatch` extension objects label Sep 8, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I wasn't able to get to a detailed review on this today. I have a few non-blocking comments on the .rst doc added here but, as they're non-blocking, I'd prefer to just submit a follow-up PR for folks to review. @rjzamora @ian-r-rose, unless I'm needed for something, I'll defer to you two as to when this PR is good to merge.

@rjzamora
Copy link
Member Author

@rjzamora @ian-r-rose, unless I'm needed for something, I'll defer to you two as to when this PR is good to merge.

I think the "public" component of this PR is ready to go, and the internal pieces can be cleaned up if/when needed. However, I'm a bit biased, so I'll gladly defer to the opinions of @ian-r-rose and @wence- :)

setup.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
rjzamora and others added 4 commits October 14, 2022 12:08
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
@jrbourbeau
Copy link
Member

It looks like we're seeing conflicts with #9283, which was just merged, and adds an entrypoint section. If you merge main, and then add the array backend option to the entrypoint section (which should now already exist in setup.cfg), I think all the formatting checks should pass

@rjzamora
Copy link
Member Author

It looks like we're seeing conflicts with #9283, which was just merged, and adds an entrypoint section

Ah-ha! I was getting very confused. Makes sense.

@rjzamora
Copy link
Member Author

Note: This branch includes the necessary changes to merge #9038 with this PR.

@ian-r-rose
Copy link
Collaborator

Note: This branch includes the necessary changes to merge #9038 with this PR.

Nice! That was quick

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with where this stands now, thanks for all your work on this @rjzamora!

@rjzamora rjzamora merged commit c4d35f5 into dask:main Oct 14, 2022
@rjzamora
Copy link
Member Author

Thanks for the tremendous help here @ian-r-rose @wence- @jrbourbeau ! Please do ping me if this causes any unexpected problems :)

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Oct 20, 2022
This PR depends on dask/dask#9475 (**Now Merged**)

After dask#9475, external libraries are now able to implement (and expose) their own `DataFrameBackendEntrypoint` definitions to specify custom creation functions for DataFrame collections. This PR introduces the `CudfBackendEntrypoint` class to create `dask_cudf.DataFrame` collections using the `dask.dataframe` API. By installing `dask_cudf` with this entrypoint definition in place, you get the following behavior in `dask.dataframe`:

```python
import dask.dataframe as dd
import dask

# Tell Dask that you want to create DataFrame collections
# with the "cudf" backend (for supported creation functions).
# This can also be used in a context, or set in a yaml file
dask.config.set({"dataframe.backend": "cudf"})

ddf = dd.from_dict({"a": range(10)}, npartitions=2)
type(ddf)  # dask_cudf.core.DataFrame
```

Note that the code snippet above does not require an explicit import of `cudf` or `dask_cudf`. The following creation functions will support backend dispatching after dask#9475:

- `from_dict`
- `read_paquet`
- `read_json`
- `read_orc`
- `read_csv`
- `read_hdf`

See also: dask/design-docs#1

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #11920
@rjzamora rjzamora deleted the backend-class branch October 31, 2022 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array dataframe dispatch Related to `Dispatch` extension objects documentation Improve or add to documentation enhancement Improve existing functionality or make things work better io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants