What? To train large DNN's over GPUs with limited memory, the model must be split across multiple devices - Model Parallelism. Similarly, training times can be reduced by distributing parallel branches on the model across the devices.
Why? Currently, the process is manual and largely based on heuristics, as we demonstrate here (Section 1.2)
How? In Baechi, we adopt an algorithmic approach to the placement problem for running DNN training graphs on a small cluster of memory-constrained devices. Baechi-PyTorch , automatically and optimally splits the model, given a number of GPU devices and their memory capacities.
Please find the design and usage information for Baechi-PyTorch here: link
Tensorflow implementation of Baechi can be found here: Baechi
The corresponding paper presented at SoCC 2020.
Draft of Baechi Extended version paper is here. (Currently, under review)
For any queries, suggestions etc please feel free to reach out at cshetty2@illinois.edu