Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Infeed/Outfeed Ops #127

Closed
seanmor5 opened this issue Jan 4, 2021 · 1 comment
Closed

Implement Infeed/Outfeed Ops #127

seanmor5 opened this issue Jan 4, 2021 · 1 comment
Labels
kind:feature New feature or request

Comments

@seanmor5
Copy link
Collaborator

seanmor5 commented Jan 4, 2021

I have been reading a bit about XLAs Infeed/Outfeed Ops and how they can be used to send data to/from the device to host during a computation, which presents an opportunity for what I think would be significant performance increases.

JAX currently holds the MLPerf records training on a large TPU cluster. Looking at their ResNet implementation you can see how their training loop makes use of XLA Ops (loops and infeeds) to speed up training. You can use Infeeds to perform multiple steps per batch without having to rerun the computation. Infeeds accept input shapes and tokens which are used for enforcing an order between operations across replicas/partitions. Adding this feature would allow us to do something similar for whatever NN library we decide to implement, and would also give users the flexibility to speed up their own custom training loops.

In the same sense that you can pass data to a device during a computation, you can receive data from a still-running computation using outfeeds. An outfeed accepts a shape as well, and then an outfeed receiver handles data received from the Outfeed. The Python XLA client implements an outfeed receiver in C++, but reading the implementation notes, it seems like Elixir is a perfect fit for handling everything the Python Outfeed Receiver is trying to do in C++.

Reading about infeeds/outfeeds in the context of TPUs, it seems that they are almost ALWAYS infeed/outfeed bound, so taking advantage TPUs in "coreless" mode is really important for performance. A TPU running in coreless mode is basically just using a TPUs CPU. The TPUs CPU has 300GB of memory which can be used for preprocessing/transformations in the data pipeline. It seems the most efficient way to train with TPU would be to:

  1. Write a training loop that makes use of Infeeds/Outfeeds for actually processing the neural network
  2. Implement an input pipeline that takes advantage of the TPUs CPU to do transformations. These transformations can also be defn compiled functions when needed or just plain Elixir for IO stuff. An additional advantage we have is that it should be very straightforward to do these transformations in parallel to pass to multiple TPU cores. A single TPU Pod has 2048 TPU cores, so they are massively parallel, but Elixir is the perfect language for handling this.
  3. In the same respect as above, we need an outfeed receiver that handles a massively parallel training pipeline.

One of the big questions is how we implement something like this so it's backend agnostic. Having an Nx.infeed or an Nx.outfeed wouldn't really make sense. I think these probably tie in best with Nx.device.

@seanmor5 seanmor5 linked a pull request Jan 8, 2021 that will close this issue
@josevalim josevalim added the kind:feature New feature or request label Jan 23, 2021
@josevalim josevalim mentioned this issue Jan 27, 2021
12 tasks
@josevalim
Copy link
Collaborator

I will open up a new issue that defines the overall roadmap for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants