-
-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environments and multi-task persistent state #85
Comments
cc @seibert |
I am working on getting the CUDA support in numba to work with distributed. Numba lazily initializes the CUDA driver. That solves the setup issue. However, there are times when I want to execute some teardown function so that the CUDA profiler works properly. Some way to control environment setup/teardown will be useful. |
I'd want something similar. You talked about how you can trick distributed into keeping the data on the workers and not destroying them, how can I do that, keep a reference to a future containing the data? |
Under normal operation you would just scatter data to the network [future] = e.scatter([data])
x = e.submit(func, future, other_args) And then you would trust the system to share the data around as needed to perform computations. If you really want to enforce that the data go everywhere you would add the [future] = e.scatter([data], broadcast=True) But this is generally slower than just trusting the network to move the data around as necessary. You'll probably want to update from master to get the benefits of worker-to-worker work stealing from #229 . |
I have a specific scenario where I basically want to distribute a resource that can't be easily divided, so I just want to distribute it at the beginning to everyone and then tell the workers to take a chunk from it for each task.
|
The scheduler will keep all data pointed to by a Scatter not supporting dictionaries during broadcast was a bug. It has been fixed in #230 . Thanks for reporting! Generally people hand Sending a custom Python object is fine. Sending an iterator probably isn't. The object will have to be cloud-pickleable and it's assumed that your functions won't mutate it. So rather than sending an iterator I would send an iterable. |
Thanks, I will try it with master version. I'm sending an iterator, but it doesn't matter how the workers modify it, it's basically immutable from their view (they can select a specific index from it, so it's more like a list anyway). I also tried something like this instead of the scatter:
After "profiling" it with wireshark, I noticed that for every task the whole |
If you want to use graphs explicitly you would need to modify your graph to point to the Beforedask = {
"data": data,
"task1": (fn, data, arg1),
"task2": (fn, data, arg2),
...
"result": ["task1", "task2", ...]
} Afterdask = {
"data": data,
"task1": (fn, "data", arg1),
"task2": (fn, "data", arg2),
...
"result": ["task1", "task2", ...]
} But really I would just do the following: [data_future] = e.scatter([data])
tasks = [e.submit(fn, data_future, arg) for arg in args]
results = e.gather(tasks) |
When I do this: [data_future] = e.scatter([data])
tasks = [e.submit(fn, data_future, arg) for arg in args]
results = e.gather(tasks) the result returned from
If I launch the exact same code, but without the scatter (passing |
This seems like a separate issue? Perhaps you should raise a new issue. Additionally it would be useful to have more information about your data and function. If you can, it'd be great to have an example that I can run locally to reproduce the error. |
I'll investigate further and if I'll be able to make a MWE and the issue will persist, I will create a separate issue. Thanks. |
OK, here is a proposal: An environment has a few components:
ExamplesA simple case might be that we want all nodes with a decent amount of available memory: def has_high_memory():
import psutil
return psutil.virtual_memory.total > 30e9
e.register_environment(name='high-memory', condition=has_high_memory)
future = e.submit(func, *args, workers='high-memory') Example 2 Database accessdef has_database_access():
try:
connect_to_database()
return True
except:
return False
def setup():
conn = connect_to_database()
return conn
def teardown(state):
state.disconnect()
e.register_environment('database', condition=has_database_access, setup=setup, teardown=teardown)
def get_data(query):
conn = get_environment_state('database')
...
e.submit(get_data, query, workers='database') I imagine that this could also be a good place for a class. How this would workWe would run these functions whenever a worker connected and keep them within a group based on whether or not they passed the check. This allows for a decent amount of control on the user's part to choose what kinds of nodes they're looking for. It does run everywhere though, so this should be fairly fast and unlikely to crash things. Questions
|
This could probably start with just the conditions, and add in setup/teardown later. I know many people who would appreciate having worker groups determined by running a predicate on every worker. I don't know as many people who need worker-level setup/teardown. Broxtronix is the exception, but I expect that he has moved on. |
Is the condition executed by every worker-thread? |
As presented here it would be evaluated by every worker process. For the purposes of deciding where something can run all threads in a worker process are considered equivalent. |
So, if I run |
Correct |
Would setup/teardown be different? Would be nice if they are executed on per-thread level. For instance, every thread gets a different GPU in a multiGPU context. Would every thread get a different DB connection be good as well? Anyway, I like the proposal. We can start experimenting with just the condition fn. |
It is less convenient though still doable to add setup per thread. Mostly this is because threads can, to a certain extent, come and go. I would want to do this after setup-per-process, which would come after condition-per-process. |
The ability that a worker can connect to a database can be evaluated at startup time. Mesos uses attributes for this purpose: Marathon defines constraints for task placement. I can imagine a syntax like |
I have a chunk of data that should live on each worker between task executions. That chunk of data, which is a dictionary of pandas DataFrames, is operated on by each task , but the operations vary depending on the other task parameters. I am trying the scatter() trick you outlined above, but running this on one 36-node machine, where my workers are spawned locally for testing, copying the data between workers is, essentally interminable. It has been taking 25 minutes to scatter this data chunk which is I think about 980MB. I am new to Dask but I am pretty sure that the ability to pin this data to each worker once, instead of recomputing it every time my task function runs, would be a big advantage. |
@gminorcoles your question is not related to this issue. Consider using the If you have further questions or comments please raise on StackOverflow or a separate GitHub issue. |
@mrocklin - has the solution you proposed for "Example 2 Database access" namely:
Been implemented or persued at this stage? Reason I ask is that I am looking at integrating the Cassandra python driver into dask.distributed at the worker level. The database connection pool will only form a connection to the local Cassandra instance and needs to stay persistently available and accessible to the worker. Or is there another way that people are efficiently handling database connections in dask.distributed? |
@thompson42 Environments were implemented in #505 but never merged, so the short answer to your question is "no". However there is renewed activity on rebuilding the workers here: #704 which might swallow the environment work in the near future. |
@mrocklin - wow, quick response, thanks. So is there currently a way to handle database connections other than create and destroy them on each job? |
That would be the simplest way, yes. You could also use globals if you do it cleverly, perhaps by attaching a value to a module. def select(query):
import dask
try:
conn = dask.conn
except AttributeError:
conn = dask.conn = connect('my-database')
results = conn.select(query) There is also the |
Any idea what happened to this? Was this solved in a later release or is it still an open problem? |
@thompson42 - how did you solve this for your use case? |
I frequently use GPU clusters, and for these workloads I find that I need to do some relatively expensive setup or teardown operations to establish a valid GPU computing context on each worker before I start the distributed processing of my data. There are various ways to "trick"
distributed
workers into maintaining a stateful execution environment for multiple jobs, probably the best of which is to use global variable in the Python interpreter itself. However, this solution (and others I'm aware of) do not provide particularly fine grained control over the distributed execution environment. It would be very useful ifdistributed
could provide some way to explicitly manage execution "environments" in which jobs could be explicitly run.I'm imagining something along these lines. I would call
which would cause each worker to run my_setup_function() and then return a reference to its copy of the new environment. The scheduler would record these and send back a single reference to the collection of environments on the workers, my_new_env in this example. Subsequent calls to run distributed functions could specify the environment in which they would like to be run:
The scheduler could keep track of the
setup()
andteardown()
functions associated with each environment. Then, if a new worker comes online and is asked to run a function in an environment that it has not yet set up, it could request the necessary initialization routine from the scheduler and run that first before running any jobs.This is a somewhat rough sketch of what would be desirable here, and I'm curious to start a discussion here to see if there are other users out there that might also want a feature like this. In particular, are there others using distributed to manage a cluster of GPU nodes? How do you manage a cluster-wide execution context?
The text was updated successfully, but these errors were encountered: