I took a look at what it would take to integrate Dask with Optuna, a hyper-parameter optimization library from the good folks at Preferred Networks (the same people that make CuPy).
Here is a tiny example that works, but slowly. It's informative.
Example
First, create a study
optuna create-study --study-name "distributed-example" --storage "sqlite:///example.db"
Python code
import optuna, time
def objective(trial):
x = trial.suggest_uniform('x', -10, 10)
time.sleep(0.050)
return (x - 2) ** 2
from dask.distributed import Client, wait
client = Client()
def f():
study = optuna.load_study(study_name='distributed-example', storage='sqlite:///example.db')
study.optimize(objective, n_trials=100)
futures = [client.submit(f, pure=False) for _ in range(100)]
wait(futures)
# look at dashboard at client.dashboard_link
Problems
- I had to load the study each time (they don't seem to be easy to serialize)
- I had to create the study with the command line (but probably there was a way to handle this in Python that I didn't find)
- The SQLite engine is slow (more below) taking up pretty much all of the time
- I had to manually create some futures, set
pure=False and so on, which might not be immediately obvious to new users
Storage
One thing we could do here is create our own storage backend for Optuna. This would either place the information in a Worker (probably with an Actor) or on the scheduler (similar to how we handle Lock/Variable/Queue/...). The Storage API isn't trivial, but is intended to be subclassed. My preference is to put this on the scheduler. Code here: https://github.com/optuna/optuna/tree/master/optuna/storages
This isn't quite as good as a long-term database (presumably it's nice to look at old runs) but it would be fast and easy for users.
Ideal user experience
How would we replace the futures stuff? Maybe something like this might be a target API?
# import optuna
import dask_optuna
study = dask_optuna.load_study(study_name='distributed-example')
study.optimize(objective, n_trials=10000)
With greater integration with Optuna we could imagine something else, perhaps like the following:
import optuna
study = optuna.load_study(study_name='distributed-example', use_dask=True)
study.optimize(objective, n_trials=10000)
This is what we see in projects like TPot. This helps to make the feature a bit more discoverable, but requires integration with an upstream library, which may not be ideal as a first step.
Who cares?
I came to this mostly from a technology perspective. These two libraries seem to both have some traction, and complement each other nicely. However, I don't know this space well enough to know if this is valuable, or if there are users who would find it interesting. I'd love to learn more about this.
I took a look at what it would take to integrate Dask with Optuna, a hyper-parameter optimization library from the good folks at Preferred Networks (the same people that make CuPy).
Here is a tiny example that works, but slowly. It's informative.
Example
First, create a study
Python code
Problems
pure=Falseand so on, which might not be immediately obvious to new usersStorage
One thing we could do here is create our own storage backend for Optuna. This would either place the information in a Worker (probably with an Actor) or on the scheduler (similar to how we handle Lock/Variable/Queue/...). The Storage API isn't trivial, but is intended to be subclassed. My preference is to put this on the scheduler. Code here: https://github.com/optuna/optuna/tree/master/optuna/storages
This isn't quite as good as a long-term database (presumably it's nice to look at old runs) but it would be fast and easy for users.
Ideal user experience
How would we replace the futures stuff? Maybe something like this might be a target API?
With greater integration with Optuna we could imagine something else, perhaps like the following:
This is what we see in projects like TPot. This helps to make the feature a bit more discoverable, but requires integration with an upstream library, which may not be ideal as a first step.
Who cares?
I came to this mostly from a technology perspective. These two libraries seem to both have some traction, and complement each other nicely. However, I don't know this space well enough to know if this is valuable, or if there are users who would find it interesting. I'd love to learn more about this.