Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Add tests for UCX and cuDF serialization #2746

Open
wants to merge 3 commits into
base: master
from

Conversation

@mrocklin
Copy link
Member

commented Jun 4, 2019

Currently we don't robustly serialize cudf dataframes or series. This causes issues when trying to communciate them in dask workloads. So far this PR adds a few failing tests around cudf serialization and a larger full dask + cudf workload.

The first thing to do here is probably to extend the serialization functions in distributed/protocol/cudf.py to include cudf.Series as well as less standard cudf issues like text data and missing values.

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

@rjzamora this task may interest you

@kkraus14

This comment has been minimized.

Copy link

commented Jun 4, 2019

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

For cudf.Series at its core is 2 Numba devicearrays, one for the data, one for the index, and then an optional 3rd Numba devicearray if there's nulls. The rest of the the info of things like the dtype, null_count, etc. can be shipped in the header pretty easily. It looks like you've handled it minus the index in https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L22 already.

For cudf.DataFrame it looks like there's a bug where the definition should be deserialize_cudf_dataframe (https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L50) but otherwise looks pretty good to me.

@kkraus14

This comment has been minimized.

Copy link

commented Jun 4, 2019

As far as text data goes, NVStrings has functions to convert the data into Arrow format which would break down into a data buffer, an offsets buffer, and a null buffer (if needed), which you should be able to ship around via your Numba devicearray code. I'm not exactly clear on how we'd plumb that through to the distributed protocol here since it would return those by copy instead of reference, but I imagine it's doable.

@rjzamora

This comment has been minimized.

Copy link

commented Jun 5, 2019

I'm currently looking into this... As far as I can tell, cudf.Series is a relatively simple extension (it is easy to get those cases to pass for the new test).

I'm a bit more confused by the strings - @kkraus14, can you say a bit more about the correct way to convert the nvstrings data into numba-serializable buffers? Should I be using to_offsets(values, offsets) ?

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Jun 6, 2019

(it is easy to get those cases to pass for the new test).

@rjzamora if you have code for this could I ask you to publish it as a small PR? This would unblock me on some join work.

@rjzamora

This comment has been minimized.

Copy link

commented Jun 6, 2019

Sorry - I got delayed on this and also broke some of my progress from yesterday. I submitted a rough PR in case it helps.

@mrocklin

This comment has been minimized.

@rjzamora

This comment has been minimized.

Copy link

commented Jun 24, 2019

The cudf component of this PR should be addressed by rapidsai/cudf#1947. However, maybe we still want to incorperate UCX/cuDF serialization testing?

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2019

That sounds reasonable to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.