Implement Infeed/Outfeed Ops #127

seanmor5 · 2021-01-04T20:37:36Z

I have been reading a bit about XLAs Infeed/Outfeed Ops and how they can be used to send data to/from the device to host during a computation, which presents an opportunity for what I think would be significant performance increases.

JAX currently holds the MLPerf records training on a large TPU cluster. Looking at their ResNet implementation you can see how their training loop makes use of XLA Ops (loops and infeeds) to speed up training. You can use Infeeds to perform multiple steps per batch without having to rerun the computation. Infeeds accept input shapes and tokens which are used for enforcing an order between operations across replicas/partitions. Adding this feature would allow us to do something similar for whatever NN library we decide to implement, and would also give users the flexibility to speed up their own custom training loops.

In the same sense that you can pass data to a device during a computation, you can receive data from a still-running computation using outfeeds. An outfeed accepts a shape as well, and then an outfeed receiver handles data received from the Outfeed. The Python XLA client implements an outfeed receiver in C++, but reading the implementation notes, it seems like Elixir is a perfect fit for handling everything the Python Outfeed Receiver is trying to do in C++.

Reading about infeeds/outfeeds in the context of TPUs, it seems that they are almost ALWAYS infeed/outfeed bound, so taking advantage TPUs in "coreless" mode is really important for performance. A TPU running in coreless mode is basically just using a TPUs CPU. The TPUs CPU has 300GB of memory which can be used for preprocessing/transformations in the data pipeline. It seems the most efficient way to train with TPU would be to:

Write a training loop that makes use of Infeeds/Outfeeds for actually processing the neural network
Implement an input pipeline that takes advantage of the TPUs CPU to do transformations. These transformations can also be defn compiled functions when needed or just plain Elixir for IO stuff. An additional advantage we have is that it should be very straightforward to do these transformations in parallel to pass to multiple TPU cores. A single TPU Pod has 2048 TPU cores, so they are massively parallel, but Elixir is the perfect language for handling this.
In the same respect as above, we need an outfeed receiver that handles a massively parallel training pipeline.

One of the big questions is how we implement something like this so it's backend agnostic. Having an Nx.infeed or an Nx.outfeed wouldn't really make sense. I think these probably tie in best with Nx.device.

The text was updated successfully, but these errors were encountered:

josevalim · 2021-02-09T09:01:39Z

I will open up a new issue that defines the overall roadmap for this.

seanmor5 linked a pull request Jan 8, 2021 that will close this issue

Add infeed and outfeed primitives #128

Closed

josevalim added the kind:feature New feature or request label Jan 23, 2021

josevalim mentioned this issue Jan 27, 2021

defn roadmap #77

Closed

12 tasks

josevalim closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Infeed/Outfeed Ops #127

Implement Infeed/Outfeed Ops #127

seanmor5 commented Jan 4, 2021

josevalim commented Feb 9, 2021

Implement Infeed/Outfeed Ops #127

Implement Infeed/Outfeed Ops #127

Comments

seanmor5 commented Jan 4, 2021

josevalim commented Feb 9, 2021