-
-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Add tests for UCX and cuDF serialization #2746
base: main
Are you sure you want to change the base?
Conversation
For For |
As far as text data goes, NVStrings has functions to convert the data into Arrow format which would break down into a data buffer, an offsets buffer, and a null buffer (if needed), which you should be able to ship around via your Numba devicearray code. I'm not exactly clear on how we'd plumb that through to the distributed protocol here since it would return those by copy instead of reference, but I imagine it's doable. |
I'm currently looking into this... As far as I can tell, I'm a bit more confused by the strings - @kkraus14, can you say a bit more about the correct way to convert the nvstrings data into numba-serializable buffers? Should I be using |
@rjzamora if you have code for this could I ask you to publish it as a small PR? This would unblock me on some join work. |
Sorry - I got delayed on this and also broke some of my progress from yesterday. I submitted a rough PR in case it helps. |
I recently discovered that cudf already has some code for this here: I wonder if we can do this work there. |
The cudf component of this PR should be addressed by rapidsai/cudf#1947. However, maybe we still want to incorperate UCX/cuDF serialization testing? |
That sounds reasonable to me |
Currently we don't robustly serialize cudf dataframes or series. This causes issues when trying to communciate them in dask workloads. So far this PR adds a few failing tests around cudf serialization and a larger full dask + cudf workload.
The first thing to do here is probably to extend the serialization functions in distributed/protocol/cudf.py to include
cudf.Series
as well as less standard cudf issues like text data and missing values.@kkraus14 any suggestions you may have on serializing cudf dataframes would be very welcome.
@rjzamora this task may interest you