-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DAG scheduling (e.g. Dagger.jl) to training of learning networks #72
Comments
(fixed typo acrylic -> acyclic) |
I'd be happy to take this one on; I have very recent experience with working with Dagger's scheduler (and plan to keep improving on it). Before I dive in, how would you like things to work dependency-wise? Would MLJ.jl directly depend on Dagger, or would you want to do the Requires.jl approach (or something else)? |
That would be awesome. Re your question. Is there a reason why Dagger should not be a dependency? I don't have any prior experience using it, so maybe this is a dumb question. |
Not really, it's pretty lightweight and loads reasonably quickly. I guess you'd want it as a direct dep since you're already doing multiprocessing in MLJ. I'll get started on this today and post a PR sometime this week once things start working 😄 |
For some orientation, I would suggest the following: Learning Networks docs, especially the last section, "The learning network API". Note that both NodalMachine objects and Node objects have linear dependency "tapes" (which presumably become DAG's in your reimplementation). There are two simple examples of learning networks there. These are also spread across two notebooks in the /examples, namely tour.ipynb and ames.ipynb. The second example blends two predictors, which should train in parallel in your implementation (but don't currently). Beware of the fact that learning network training is "smart" in the sense that components of a network are only retrained if upstream changes make this necessary. This should be clear enough from the docs. This doesn't directly effect you, I guess, but it could confuse you if you did not know about it. The code you will be messing with lives primarily in src/networks.jl, but note that the NodalMachine There's a lot to get your head around here. (I sometimes forget myself how all this works!) Also, designing meaningful tests could take a bit of work. Many thanks for having a go at this. It certainly would be very cool you can get this to work. Tip: To get a recursively defined "developers' dump" of any MLJ object at the REPL, just type "@more" after defining the object. The numbers you see in REPL representations of objects (as in Some components a learning network (eg, ones wrapped as TunedModels with a cv strategy) will have the option of running in a Distributed mode. Is this compatible with the Dagger scheduling? |
Thanks for the links to the Learning Networks docs and Quick primer on Dagger computation: Dagger has a Back to MLJ: The current tape semantics don't allow one to determine the inputs to a certain DAG node, only the order of execution. I'm going to leave this mechanism alone for now and just recursively compute the DAG at each call to I think that we shouldn't have problems creating tests, but I haven't yet looked into what's currently in place; I'll raise a flag if I need help constructing a good test suite.
Can you point me to an example of where this is done? I don't think it should be a problem to mix raw Distributed with Dagger DAGs; you can use as little or as much of Dagger as you want, and if we want to just turn it off for all or part of a learning network, I can make that happen. I've got a few things going on this week, but am already working on the integration of Dagger right now; hopefully I can have a PR posted by tomorrow so that you can see what changes are required to make this work. |
Thanks for explanations! Fitting an And And external models often have an option to run in parallel, or wrap compiled code that runs on multiple processors, eg XGBoost.jl Warning and apology I have just detected a flaw in the logic for fitting nodal machines. Sometimes a call to fit a a machine is calling |
Okay, I've had to make a few changes, to resolve #146 (and to lesser extent, #147). The issues give a few details but re-reading the up-dated documentation should help also. Key differences are:
I've worked quite a bit adding tests to fill the hole that existed before and some. It may be instructive to run over the tests in Perhaps worth remarking that the new NodalMachine fields
When called for the first time, attempt to call A machine Note that a nodal machine obtains its training data by calling its |
Let me have another go at describing the scheduler requirements, as unambiguously as possible. I'm guessing the key part is describing the edges of the dependency DAG, which is called "Crucial bit" below. For brevity, a "machine" is either a Reduction of the scheduling problem to specification of a DAGLet The DAG Note that multiple nodes of Aside: The present The DAG specificationCrucial bit: Let Labels: If a node |
@jpsamaroo I have pushed a branch called The main change is that I have abandoned the "machine" tapes in favour of "node" tapes. This is because:
|
That's totally fine, merge it whenever you deem it most appropriate. I'll rebase and re-adjust my code once I get back to working on my PR a bit more, and I'll check the diff to get an idea for how the various bits and pieces have moved around. I've already gotten a somewhat working prototype for using Dagger for training on just the master process, and in the process I didn't have to touch the tapes at all. So if your branch is just moving around things at the tape level, it shouldn't be too terribly impactful to the code I've already written. (And even if it is very impactful, I'm happy to take up the challenge of rebasing on top of your changes 😄 ). Thanks for the heads-up! |
I've been thinking on this a lot since I started #151, and I'm starting to come to the conclusion that I might be implementing parallel computing support a bit backwards, both in terms of difficulty level, and in terms of utility to the end user. While just dumping the MLJ DAG into Dagger sounds trivial, it's rife with complications around data sharing and synchronization. It also isn't guaranteed to actually be faster than running on the head node; in fact, I bet my PR will be a regression in performance in almost every case. There's also the mental factor to consider: this is a difficult PR that's been pretty stressful for me to wrap my head around, and the end goal feels like it keeps stepping away from me every time I take a step towards it. However, I don't want to just drop the work I've done, and leave MLJ without the ability to make good use of Julia's excellent multiprocessing (and new multithreading) utilities. Instead, I think it would be tackle some lower-hanging fruit first, and gradually work back up to the work in #151 once those are working well (if that ends up still being desired). So far, the functionality that seems obviously ripe for parallel exploitation are tuning, and internal fitting or evaluation of the model itself. Tuning feels pretty straightforward ("MT" means multithreaded): just copy the model to another worker (MT: just copy it), set the appropriate parameters, fit, and send back the measurement result (MT: no need to serialize anything, just return the result). Model parallelization is maybe less straightforward, but also will probably involve less work on MLJ's part, because it's only an API change. We only need to tell the model whether or not it's permitted/expected to run in parallel, and pass in information about how it should parallelize (workers or threads, and how many of each?). I'd like to know especially what you think @ablaom , but if anyone else has thoughts or suggestions for additional parallelization opportunities, I'm all ears 😄 |
@jpsamaroo Thanks for the update. Appreciate all the hard work put in so far. Would certainly appreciate contributions around MT and parallelisation elsewhere in the package. Here's where we are at with parallelisation (aka distributed computing). At present no MT implementation in MLJ: models Generally parallelism and MT at the model level would be the responsibility of those who implement the model algorithms, generally happening in external packages. I guess one issue is how the model "meta-algorithms" in MLJ play with the parallel/MT stuff happening in model training. tuning Tuning comes in different strategies ( resampling Every tuning strategy is repeatedly evaluating a model's performance with different hyper parameter values. One gets an estimate of performance using |
Update A revamp of the composition code means that learning networks are now trained "asynchronously" using tasks; see https://github.com/alan-turing-institute/MLJBase.jl/blob/6eb7eab3ffaded8a0c74b3d9782d22943c7b5311/src/composition/learning_networks/nodes.jl#L198 It should be far easier now to make this training multi-threaded, for example. |
Currently each node and machine in a learning network has a simple linear "tape" to track dependencies on machines in the network. I had in mind to replace these tapes with directed acyclic graphs, which (hopefully) makes scheduling amenable to Dagger.jl or similar.
A thorough understanding of the learning network interface at src/networks.jl will be needed. If someone has experience with scheduling, I could provide guidance, but this is probably not a small project.
The text was updated successfully, but these errors were encountered: