Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose literal and quote #3275

Closed
wants to merge 3 commits into from
Closed

Conversation

jakirkham
Copy link
Member

Includes literal and quote in the public API following discussion in this SO question.

cc @mrocklin

@jakirkham
Copy link
Member Author

We'd like to get your thoughts on this @jcrist. Does this seem like a reasonable way to approach the problem?

@jcrist
Copy link
Member

jcrist commented Mar 12, 2018

I don't feel comfortable exposing this as top-level api for a few reasons:

  • quote would be the first top-level function for directly creating tasks in the tuple syntax. This to me feels more apt to remain in dask.core.
  • literal is a class that was designed specifically to work with quote and nothing else. This was not originally meant to be public facing. There are no methods for accessing the value, as it was designed to be used in a task such that the literal is the callable (leading to slightly smaller graphs).

I'd be fine if you used them in your own code, as I doubt they'll change, but I wouldn't add them to the top-level namespace.


For your specific case, I'd first ask why you want access to a future in the delayed function (as Matt pointed out, this is a bit unusual). If there's not a better way of handling this, then I'd add a literal=True (default to False) flag to delayed such that wrapped values are the literal result with no conversion done (potentially a better kwarg name).

@delayed
def foo(future, *args):
    ...

foo(delayed(future, literal=True), *args)

This seems more consistent, keeps our api surface low, and doesn't expose internal bits at the task level in the top-level namespace.

@jakirkham
Copy link
Member Author

Was thinking about this functionality in the context of this comment. The strategy there is not limited to HDF5 files and could be just as applicable to addressing issue ( #3255 ). IOW really any situation where writing to a file cannot be done safely in parallel and we would like to open the file only once in a task to handle this writing. Perhaps there are other strategies that would still allow a single task to write to a file. Would be interested in your input in those contexts as well. :)

@jcrist
Copy link
Member

jcrist commented Mar 12, 2018

Interesting. I recommend opening an issue about this specific issue (especially if you can think of a few more usecases besides hdf5 for support) and we can work for a resolution there. As for this specific fix, I'm going to say no for now (although we can re-open later if needed).

@mrocklin
Copy link
Member

People have asked for similar things in the past. This would also be relevant to passing dask arrays or dask dataframes into delayed functions. This does come up sometimes in advanced workloads by advanced users. It does come up more often by less advanced users who don't quite know what they're doing though. I think that having a literal callable in the API is valid but that exposing it is dangerous.

Doing this as the keyword level seems less good to me for two reasons:

  1. It's difficult to mix-and-match
  2. I think it will be more visible to users

Honestly the status quo of using dask.core.literal might not be a bad compromise.

@jcrist
Copy link
Member

jcrist commented Mar 12, 2018

People have asked for similar things in the past. This would also be relevant to passing dask arrays or dask dataframes into delayed functions. This does come up sometimes in advanced workloads by advanced users.

Ok, I can see how this might be nice to have.

Doing this as the keyword level seems less good to me for two reasons:

  1. It's difficult to mix-and-match
  2. I think it will be more visible to users

I'm not sure I agree with that. Given the usecase that I want to pass some object to a delayed function in a way that dask that does nothing to it, both ways look equivalent:

# Using literal
foo(literal(a), b)

# using delayed
foo(delayed(a, literal=True), b)

However, the second option has a few things going for it:

  • User doesn't need to know anything about the literal class to get the value. In the second option, foo would recieve a at compute time, while the literal option would pass literal(a), and foo would need to extract a from the literal directly.
  • delayed exposes the normal keywords for token naming, etc..., which may be asked about later if they want to control the name of these tasks in the graph. No need to bring in another object with these semantics when they're already implemented for delayed.
  • The use of delayed already matches with its usecase - users are used to the delayed function, and would know where to look. A second function might not be so apparent. In this respect I think I disagree directly with your point 2.

It does come up more often by less advanced users who don't quite know what they're doing though.

I also think this is a bit of an odd thing to do, so making it not apparent in the top-level api might help force people to rethink their problem a bit before diving in.

@mrocklin
Copy link
Member

That makes sense to me. I think I was thinking of the following:

foo(a, b, literal=True)

I was not considering using delayed to wrap the object directly. This seems fine to me.

@jcrist
Copy link
Member

jcrist commented Mar 12, 2018

Just thinking about keyword names, perhaps we might want to deprecate the boolean traverse kwarg in delayed, and instead switch to a categorical keyword (still not sure on a good name). We're going from having two options:

  • Convert everything (traverse=True, default)
  • Only convert direct dask objects, not collections that might contain them (traverse=False)

to three:

  • Convert everything
  • Only convert direct dask objects, not collections that might contain them
  • Convert nothing, even direct dask objects (could be traverse=None, but that seems wrong).

Seems like these are all exclusive options, so one keyword makes more sense than 2.

@jakirkham
Copy link
Member Author

jakirkham commented Mar 12, 2018

So your foo(delayed(a, literal=True), b) proposal looks nice to me and generally agree with the arguments for it. Though would suggest calling it something different instead of delayed for two reasons. First it will aid readability. Second it avoids these two use cases being attached to the same function if user needs start pushing them in different directions. So maybe something like foo(delayed_coll(a), b).

@jcrist
Copy link
Member

jcrist commented Mar 14, 2018

Though would suggest calling it something different instead of delayed for two reasons. First it will aid readability. Second it avoids these two use cases being attached to the same function if user needs start pushing them in different directions. So maybe something like foo(delayed_coll(a), b).

Normally I would agree with you that a function should do only one thing, and other things should have other names. However, this is outweighed in my mind by the following benefits:

  • Users are already used to turning objects into delayed objects with delayed. This would be the first place they would look for such an option.
  • Since we already support the traverse flag (which is only slightly different than what you want), this seems like a natural extension.
  • More function names require the user to keep more functions in their head. We also have to think of a clear name (delayed_coll is not clear to me).

@jakirkham
Copy link
Member Author

Honestly this was just a style point. No strong feelings personally.

So are we ok with using traverse? Are we setting it to None? Something else?

@jcrist
Copy link
Member

jcrist commented Mar 20, 2018

So are we ok with using traverse? Are we setting it to None? Something else?

Thinking about this some more, I think a boolean flag is fine but the existing semantics of traverse are slightly off. My preference would be to deprecate traverse in favor of the following:

  • wrap=True (default), matches the existing default. Collections are searched for dask objects, and direct dask collections are wrapped and converted to delayed values.
  • wrap=False. Everything is treated as a literal value, even direct dask objects. This is only slightly different from traverse=False, which would wrap direct dask objects but not search collections.

My thought is that the times where you want traverse=False to convert existing dask objects but not search collections are few, and can be accomplished with a small bit of external logic:

# this is equivalent to `traverse=False`, using the proposed changes above
x2 = delayed(x, wrap=is_dask_collection(x))

The traverse flag was originally added to prevent expensive traversal of collections when the user new there were no dask objects inside, which is still accomplished by wrap=False, but has simpler semantics than a categorical flag.

I'm not set on the name wrap, but I think the boolean options proposed above make the most sense. Thoughts?

Edit: could also see literal being a decent kwarg name, with literal=False as the default. Also cc @TomAugspurger, who has a good eye for api.

@jakirkham
Copy link
Member Author

So I vaguely remember you mentioning something about thoughts you had on this last week, @jcrist, but don't recall whether we discussed them. Does this ring a bell? Do you remember what they were?

@jsignell
Copy link
Member

I am closing this since it is pretty stale at this point. Please reopen if you are planning on continuing the work.

@jsignell jsignell closed this Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants