Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Add tests for UCX and cuDF serialization #2746

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Jun 4, 2019

Currently we don't robustly serialize cudf dataframes or series. This causes issues when trying to communciate them in dask workloads. So far this PR adds a few failing tests around cudf serialization and a larger full dask + cudf workload.

The first thing to do here is probably to extend the serialization functions in distributed/protocol/cudf.py to include cudf.Series as well as less standard cudf issues like text data and missing values.

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

@rjzamora this task may interest you

@kkraus14
Copy link
Member

kkraus14 commented Jun 4, 2019

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

For cudf.Series at its core is 2 Numba devicearrays, one for the data, one for the index, and then an optional 3rd Numba devicearray if there's nulls. The rest of the the info of things like the dtype, null_count, etc. can be shipped in the header pretty easily. It looks like you've handled it minus the index in https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L22 already.

For cudf.DataFrame it looks like there's a bug where the definition should be deserialize_cudf_dataframe (https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L50) but otherwise looks pretty good to me.

@kkraus14
Copy link
Member

kkraus14 commented Jun 4, 2019

As far as text data goes, NVStrings has functions to convert the data into Arrow format which would break down into a data buffer, an offsets buffer, and a null buffer (if needed), which you should be able to ship around via your Numba devicearray code. I'm not exactly clear on how we'd plumb that through to the distributed protocol here since it would return those by copy instead of reference, but I imagine it's doable.

@rjzamora
Copy link
Member

rjzamora commented Jun 5, 2019

I'm currently looking into this... As far as I can tell, cudf.Series is a relatively simple extension (it is easy to get those cases to pass for the new test).

I'm a bit more confused by the strings - @kkraus14, can you say a bit more about the correct way to convert the nvstrings data into numba-serializable buffers? Should I be using to_offsets(values, offsets) ?

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2019

(it is easy to get those cases to pass for the new test).

@rjzamora if you have code for this could I ask you to publish it as a small PR? This would unblock me on some join work.

@rjzamora
Copy link
Member

rjzamora commented Jun 6, 2019

Sorry - I got delayed on this and also broke some of my progress from yesterday. I submitted a rough PR in case it helps.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2019

@rjzamora
Copy link
Member

The cudf component of this PR should be addressed by rapidsai/cudf#1947. However, maybe we still want to incorperate UCX/cuDF serialization testing?

@mrocklin
Copy link
Member Author

That sounds reasonable to me

Base automatically changed from master to main March 8, 2021 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants