Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Allow collections_to_dsk to be overridden #3196

Closed
wants to merge 1 commit into from

Conversation

jakirkham
Copy link
Member

To make it easier for user to hook into the presubmission process, allow users to provide their own collections_to_dsk function.

To make it easier for user to hook into the presubmission process, allow
users to provide their own `collections_to_dsk` function.
@jcrist
Copy link
Member

jcrist commented Feb 22, 2018

Why do you need this? Without strong motivation I'm fairly against adding this as it blurs the boundary between dask internals and user provided methods. I'd wonder if what you're trying to accomplish can better be served via a different hook.

@jakirkham
Copy link
Member Author

This probably has some issues admittedly as it is a bit more exploratory at this stage.

My use case is very similar to the one raised in issue ( dask/distributed#1384 ). Namely a desire to have a persistent cache that both works in the same analytical session and across analytical sessions. One of the suggestions, was to override collections_to_dsk a Client class. While that may work, it feels like a pretty heavy solution to me. So it would be nice if there was some sort of hook that didn't involve adding another Client into the mix (or subclassing an existing one). Maybe some way of registering a callback instead or something else like this?

If you have ideas on different ways to hook in, would be interested to hear. :)

@jcrist
Copy link
Member

jcrist commented Feb 22, 2018

If you have ideas on different ways to hook in, would be interested to hear. :)

I'll give it some thought. To clarify, when you say persistent cache do you mean:

  • Keys that already exist on the cluster, either persisted or added by a different user?
  • Keys that have been stored to some persistent (disk backed?) cache to be used later?
  • Something else entirely?

Do you want this to work just with the distributed scheduler or on all schedulers?

@jakirkham
Copy link
Member Author

Thanks.

Both. To clarify, if we have submitted the jobs during this session, this should be case 1. If this was an old session we are reviving, then this would be case 2. There's also the chance that in the same session we have had to expire some keys due to memory issues and then would also have to pull results from disk.

Personally am primarily interested in the distributed scheduler. In practice it would be using dask-drmaa. However if the solution is geared for some form of distributed scheduler, am hopeful it would work on different flavors of distributed schedulers pretty easily.

@jakirkham
Copy link
Member Author

FWIW have also been playing with a MutableMapping approach. In this case, one would explicitly store things in the cache with a specified key and be able to retrieve it later with said key. This is designed to trigger computation and storage to disk immediately. Tried as much as possible to make things proceed asynchronously. The downside is it does trigger computation immediately and it requires manually grabbing cached values. There's also some trickery involved in clearing contents from the caching without invalidating existing references in Dask objects to the data in the current session.

@jakirkham jakirkham changed the title Allow collections_to_dsk to be overridden WIP: Allow collections_to_dsk to be overridden Feb 27, 2018
@jakirkham
Copy link
Member Author

Have you had a chance to give this more thought @jcrist?

@jakirkham
Copy link
Member Author

Think this can be better addressed by the "optimizations" in dask.config.

@jakirkham jakirkham closed this Jun 25, 2018
@jakirkham jakirkham deleted the set_opt_collections_to_dsk branch June 25, 2018 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants