WIP - Add tests for UCX and cuDF serialization #2746

mrocklin · 2019-06-04T00:32:38Z

Currently we don't robustly serialize cudf dataframes or series. This causes issues when trying to communciate them in dask workloads. So far this PR adds a few failing tests around cudf serialization and a larger full dask + cudf workload.

The first thing to do here is probably to extend the serialization functions in distributed/protocol/cudf.py to include cudf.Series as well as less standard cudf issues like text data and missing values.

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

@rjzamora this task may interest you

kkraus14 · 2019-06-04T01:28:58Z

@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.

For cudf.Series at its core is 2 Numba devicearrays, one for the data, one for the index, and then an optional 3rd Numba devicearray if there's nulls. The rest of the the info of things like the dtype, null_count, etc. can be shipped in the header pretty easily. It looks like you've handled it minus the index in https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L22 already.

For cudf.DataFrame it looks like there's a bug where the definition should be deserialize_cudf_dataframe (https://github.com/dask/distributed/blob/master/distributed/protocol/cudf.py#L50) but otherwise looks pretty good to me.

kkraus14 · 2019-06-04T01:46:27Z

As far as text data goes, NVStrings has functions to convert the data into Arrow format which would break down into a data buffer, an offsets buffer, and a null buffer (if needed), which you should be able to ship around via your Numba devicearray code. I'm not exactly clear on how we'd plumb that through to the distributed protocol here since it would return those by copy instead of reference, but I imagine it's doable.

rjzamora · 2019-06-05T21:37:20Z

I'm currently looking into this... As far as I can tell, cudf.Series is a relatively simple extension (it is easy to get those cases to pass for the new test).

I'm a bit more confused by the strings - @kkraus14, can you say a bit more about the correct way to convert the nvstrings data into numba-serializable buffers? Should I be using to_offsets(values, offsets) ?

mrocklin · 2019-06-06T18:42:37Z

(it is easy to get those cases to pass for the new test).

@rjzamora if you have code for this could I ask you to publish it as a small PR? This would unblock me on some join work.

rjzamora · 2019-06-06T19:18:36Z

Sorry - I got delayed on this and also broke some of my progress from yesterday. I submitted a rough PR in case it helps.

mrocklin · 2019-06-06T19:44:26Z

I recently discovered that cudf already has some code for this here:

https://github.com/rapidsai/cudf/blob/7312d4aa413eafdad28b67efccbc3dc15d9b5e27/python/cudf/dataframe/dataframe.py#L122-L135

https://github.com/rapidsai/cudf/blob/7312d4aa413eafdad28b67efccbc3dc15d9b5e27/python/cudf/dataframe/series.py#L102-L111

I wonder if we can do this work there.

rjzamora · 2019-06-24T14:17:08Z

The cudf component of this PR should be addressed by rapidsai/cudf#1947. However, maybe we still want to incorperate UCX/cuDF serialization testing?

mrocklin · 2019-06-24T14:17:46Z

That sounds reasonable to me

mrocklin added 3 commits June 3, 2019 17:29

Add tests for cudf serialization

93f714e

Add test for cudf with ucx

61d6c35

Remove unnecessary is_cuda entry from header

30145ab

rjzamora mentioned this pull request Jun 6, 2019

[WIP] Some progress for cudf serialization mrocklin/distributed#4

Closed

Base automatically changed from master to main March 8, 2021 19:03

mrocklin requested a review from fjetter as a code owner January 23, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Add tests for UCX and cuDF serialization #2746

WIP - Add tests for UCX and cuDF serialization #2746

mrocklin commented Jun 4, 2019 •

edited

Loading

kkraus14 commented Jun 4, 2019

kkraus14 commented Jun 4, 2019

rjzamora commented Jun 5, 2019

mrocklin commented Jun 6, 2019

rjzamora commented Jun 6, 2019

mrocklin commented Jun 6, 2019

rjzamora commented Jun 24, 2019

mrocklin commented Jun 24, 2019

WIP - Add tests for UCX and cuDF serialization #2746

Are you sure you want to change the base?

WIP - Add tests for UCX and cuDF serialization #2746

Conversation

mrocklin commented Jun 4, 2019 • edited Loading

kkraus14 commented Jun 4, 2019

kkraus14 commented Jun 4, 2019

rjzamora commented Jun 5, 2019

mrocklin commented Jun 6, 2019

rjzamora commented Jun 6, 2019

mrocklin commented Jun 6, 2019

rjzamora commented Jun 24, 2019

mrocklin commented Jun 24, 2019

mrocklin commented Jun 4, 2019 •

edited

Loading