New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize dd.to_datetime
for GPU-backed collections, introduce get_meta_library
utility
#9881
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this utility makes sense @charlesbluca - Thanks!
My only preference would be to replace the term serial
with meta
throughout. I think this would make it clearer that we are extracting the module name for collection's current "meta" property. This naming would also be more consistent with the existing meta_lib_from_array
dispatch.
Side note: I agree with your decision to use a utility instead of a dispatch function (like meta_lib_from_array
). The same logic already works for cudf, pandas, cupy and numpy. Therefore, there is really no reason to over-engineer a dispatch-based solution for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this seems reasonable to me. I can think of other places where this utility would be useful. Also agree with @rjzamora about the name change to something like get_meta_library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca!
pinging @rjzamora @jrbourbeau if you get a chance to re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca! Apologies for the delayed response.
Overall this looks good. I have one minor question about adding the rapidsai-nightly
channel, but otherwise this PR could be merged as is.
A couple of non-blocking comments:
- Have you confirmed the changes here close Generalize
dd.to_datetime
to support cuDF-backed Series #9880? It might be good to add a test test_get_meta_library
covers the usual,numpy
/pandas
backed cases. It might be worth adding a second test thatcupy
/cudf
backed collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @jrbourbeau that we should test that this resolves the motivating issue. Otherwise, this looks good. Only had a nit-pick suggestion.
Looks like the gpuCI failures here are due to differences in datetime handling between cuDF and pandas - going to see what can be done upstream before making any more updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca. Overall this looks good. Just had a couple of final questions about the to_datetime
-specific changes
dask/dataframe/core.py
Outdated
if not isinstance(arg, _Frame): | ||
arg_meta = arg.head(2) | ||
else: | ||
if is_series_like(arg): | ||
arg_meta = ( | ||
meta_series_constructor(arg)(["2000/1/1"]) | ||
if arg.dtype.kind == "O" | ||
else arg._meta_nonempty | ||
) | ||
else: | ||
arg_meta = arg._meta_nonempty | ||
assign_dict = dict() | ||
for col in arg_meta.columns: | ||
if arg_meta[col].dtype.kind == "O": | ||
assign_dict[col] = "2000" if col == "year" else "1" | ||
elif col == "year" and is_numeric_dtype(arg_meta[col]): | ||
assign_dict[col] = 2000 | ||
arg_meta = arg_meta.assign(**assign_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks complicated -- why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf.to_datetime
has a somewhat nuanced handling of datetime precision compared to pandas that made it so the only way to get the correct meta here was by directly calling xd.to_datetime
on some subset of the data. This gets a little hairy when using _meta_nonempty
, which generates data that would error in to_datetime
(string columns containing non-integers, int columns containing out of range years, etc.), which this code attempts to resolve.
As I'm writing this, I wonder if maybe a simpler solution here would be to disable dtype checking on the Dask object in the tests? Even with this code, there are still some cases I can think of where the dtypes of the meta and the computed frame wouldn't match up (at this point, we don't know if any of the columns contain timezone-aware dates), and I'd imagine we'd prefer keeping the original meta construction which, though inaccurate in a few cases, isn't likely to introduce a breaking change.
Do you have a preference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going back to the original meta construction and loosening dtype checks on GPU tests - do we think it's worth emitting a warning here if we run into that case?
@charlesbluca This has been sitting for a while, but it looks like all the reviews were 👍 . Is there any reason not to merge it? |
Think things here should be good to merge, cc @jrbourbeau in case you wanted to take another look following the simplications to meta construction? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping @j-bennet. I agree with @charlesbluca, this looks good to go 👍
dd.to_datetime
for GPU-backed collections, introduce get_serial_module
utilitydd.to_datetime
for GPU-backed collections, introduce get_meta_library
utility
This PR should unblock some datetime related work going on in dask-contrib/dask-sql#984, but also hopefully can be a place to host a wider discussion around what a utility API to get the top-level module of a Dask collection's partition type should look like.
cc @rjzamora @sarahyurick
dd.to_datetime
to support cuDF-backed Series #9880pre-commit run --all-files