Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize dd.to_datetime for GPU-backed collections, introduce get_meta_library utility #9881

Merged
merged 16 commits into from May 23, 2023

Conversation

charlesbluca
Copy link
Member

This PR should unblock some datetime related work going on in dask-contrib/dask-sql#984, but also hopefully can be a place to host a wider discussion around what a utility API to get the top-level module of a Dask collection's partition type should look like.

cc @rjzamora @sarahyurick

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this utility makes sense @charlesbluca - Thanks!

My only preference would be to replace the term serial with meta throughout. I think this would make it clearer that we are extracting the module name for collection's current "meta" property. This naming would also be more consistent with the existing meta_lib_from_array dispatch.

Side note: I agree with your decision to use a utility instead of a dispatch function (like meta_lib_from_array). The same logic already works for cudf, pandas, cupy and numpy. Therefore, there is really no reason to over-engineer a dispatch-based solution for this.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this seems reasonable to me. I can think of other places where this utility would be useful. Also agree with @rjzamora about the name change to something like get_meta_library

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca!

continuous_integration/environment-3.9.yaml Outdated Show resolved Hide resolved
dask/utils.py Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
@charlesbluca charlesbluca marked this pull request as ready for review February 24, 2023 14:49
@charlesbluca
Copy link
Member Author

pinging @rjzamora @jrbourbeau if you get a chance to re-review

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca! Apologies for the delayed response.

Overall this looks good. I have one minor question about adding the rapidsai-nightly channel, but otherwise this PR could be merged as is.

A couple of non-blocking comments:

  1. Have you confirmed the changes here close Generalize dd.to_datetime to support cuDF-backed Series #9880? It might be good to add a test
  2. test_get_meta_library covers the usual, numpy / pandas backed cases. It might be worth adding a second test that cupy / cudf backed collections.

continuous_integration/environment-3.9.yaml Outdated Show resolved Hide resolved
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jrbourbeau that we should test that this resolves the motivating issue. Otherwise, this looks good. Only had a nit-pick suggestion.

dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/reshape.py Outdated Show resolved Hide resolved
@charlesbluca
Copy link
Member Author

Looks like the gpuCI failures here are due to differences in datetime handling between cuDF and pandas - going to see what can be done upstream before making any more updates

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca. Overall this looks good. Just had a couple of final questions about the to_datetime-specific changes

Comment on lines 8060 to 8077
if not isinstance(arg, _Frame):
arg_meta = arg.head(2)
else:
if is_series_like(arg):
arg_meta = (
meta_series_constructor(arg)(["2000/1/1"])
if arg.dtype.kind == "O"
else arg._meta_nonempty
)
else:
arg_meta = arg._meta_nonempty
assign_dict = dict()
for col in arg_meta.columns:
if arg_meta[col].dtype.kind == "O":
assign_dict[col] = "2000" if col == "year" else "1"
elif col == "year" and is_numeric_dtype(arg_meta[col]):
assign_dict[col] = 2000
arg_meta = arg_meta.assign(**assign_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated -- why is this change needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf.to_datetime has a somewhat nuanced handling of datetime precision compared to pandas that made it so the only way to get the correct meta here was by directly calling xd.to_datetime on some subset of the data. This gets a little hairy when using _meta_nonempty, which generates data that would error in to_datetime (string columns containing non-integers, int columns containing out of range years, etc.), which this code attempts to resolve.

As I'm writing this, I wonder if maybe a simpler solution here would be to disable dtype checking on the Dask object in the tests? Even with this code, there are still some cases I can think of where the dtypes of the meta and the computed frame wouldn't match up (at this point, we don't know if any of the columns contain timezone-aware dates), and I'd imagine we'd prefer keeping the original meta construction which, though inaccurate in a few cases, isn't likely to introduce a breaking change.

Do you have a preference here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going back to the original meta construction and loosening dtype checks on GPU tests - do we think it's worth emitting a warning here if we run into that case?

dask/dataframe/core.py Outdated Show resolved Hide resolved
@j-bennet
Copy link
Contributor

@charlesbluca This has been sitting for a while, but it looks like all the reviews were 👍 . Is there any reason not to merge it?

@charlesbluca
Copy link
Member Author

Think things here should be good to merge, cc @jrbourbeau in case you wanted to take another look following the simplications to meta construction?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @j-bennet. I agree with @charlesbluca, this looks good to go 👍

@jrbourbeau jrbourbeau changed the title Generalize dd.to_datetime for GPU-backed collections, introduce get_serial_module utility Generalize dd.to_datetime for GPU-backed collections, introduce get_meta_library utility May 23, 2023
@jrbourbeau jrbourbeau merged commit fff0e81 into dask:main May 23, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generalize dd.to_datetime to support cuDF-backed Series
4 participants