Fetching contributors…
Cannot retrieve contributors at this time
34 lines (24 sloc) 6.85 KB

Amazon DSSTNE: Q&A

What is DSSTNE?

Deep Scalable Sparse Tensor Network Engine, (DSSTNE), pronounced “Destiny”, is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models for recommendations. DSSTNE significantly outperforms current open source DL libraries on certain tasks where training data is sparse (where almost all the values are zero), common in recommendation problems.

Why is Amazon open sourcing DSSTNE?

We are releasing DSSTNE as open source software so that the promise of deep learning can extend beyond speech and language understanding and object recognition to other areas such as search and recommendations. We hope that researchers around the world can collaborate to improve it. But more importantly, we hope that it spurs innovation in many more areas.

Why did Amazon write this library?

Every day, hundreds of millions of customers shop on Amazon. We help them discover the right product from an immense catalog of products. Making good recommendations requires neural networks. Even a simple 3-layer autoencoder with an input layer of hundreds of million nodes (one for every product), a hidden layer of 1000 nodes, and an output layer mirroring the input layer has upwards of a trillion parameters to learn. Once again, this is hard to solve with today’s hardware. Reducing the size of the network by limiting it to a single product category and to users in the United States also pushes the boundaries of present day GPUs. For example, the weight matrices of a 3-layer autoencoder with 8 million nodes in the input and output layer and 256 nodes in the hidden layer consumes 8GB of memory in single precision arithmetic. Even training such a network using open source software with shopping data from tens of millions of users would take weeks on the fastest GPUs available on the market. A few weeks into this endeavor we realized that we couldn’t make meaningful progress without writing software that parallelized backpropagation by distributing the computation across multiple GPUs. So we built our own deep learning package.

What makes DSSTNE different/differentiated from other deep learning libraries/offerings?

DSSTNE has been built from the ground up to support use cases where training data is sparse, while other libraries such as Caffe, TensorFlow, Theano and Torch have a larger feature set and network support. DSSTNE resembles Caffe in spirit, emphasizing performance for production applications. DSSTNE is much faster than any other DL package (2.1x compared to TensorFlow in 1 g2.8xlarge) for problems involving sparse data, which includes recommendations problems and many natural language understanding (NLU) tasks. DSSTNE is also significantly better than other packages at using multiple GPUs in a single server. DSSTNE can automatically distribute each computation across all available GPUs, speeding up all computations, and also enabling larger models than other packages can build without heroic effort. Practically, this means being able to build recommendations systems that can model ten million unique products instead of hundreds of thousands, or NLU tasks with very large vocabularies. For problems of this size, other packages would need to revert to CPU computation for the sparse data, decreasing performance by about an order of magnitude. Also, DSSTNE’s network definition language is much simpler than Caffe’s, as it would require only 33 lines of code to express the popular AlexNet image recognition model, whereas Caffe’s language requires over 300 lines of code. DSSTNE does not yet support the convolutional layers needed for image processing, and has only limited support for recurrent layers needed for many NLU and speech recognition tasks, but both of these features are on its immediate roadmap.

What are the new and innovative capabilities?

  1. 2.1x Speedup vs TensorFlow in a g2.8xlarge for MovieLens recommendations.
  2. Multi-GPU capability of model-parallelism, increasing the speed of training by automatically distributing the computation between multiple interconnected GPUs in the same host. DSSTNE makes it easy to take advantage of servers with multiple GPUs by automatically distributing the computations across GPUs. Scaling out deep neural network training jobs is important because these jobs can take a very long time to run, limiting experimentation. The most common pattern is to split the training data across GPUs, have each GPU train its own model on a part of the data and then keep those models in sync somehow. This pattern, called “data-parallel training”, generally requires making trade-offs between speed and accuracy. DSSTNE instead uses “model-parallel training”, where each layer of the network is split across the available GPUs so each operation just runs faster. Model-parallel training is harder to implement, but it doesn’t come with the same speed/accuracy trade-offs of data-parallel training. Other neural network libraries offer a version of model-parallel training where each layer or operation can be assigned to a different GPU. While this approach works for some problems, it doesn’t work in situations where the weight matrices just don’t fit in the memory of a single GPU – like the recommendations problem at Amazon.

What is the license associated with DSSTNE? Restrictions? Attributions? Requirements?

DSSTNE is licensed with the business-friendly Apache 2.0 license

Can DSSTNE be run on multiple instances working on the same data set?

Yes, it can be run on cluster with multiple instances (like EMR). If a MPI cluster is set up, the framework detects the lack of Peer-to-Peer connectivity between the GPUs in play and automagically switches to MPI system memory collectives. That being said, please ensure that all your nodes are in the same placement group connected by Elastic File System, so that you can share the data across the instances

Is there a way to run this on multiple instances working on the same data set? like EMR?

Yes, if you set up an MPI cluster, it should work. The framework detects the lack of P2P connectivity between the GPUs in play and automagically switches to MPI system memory collectives (distributed enabled from day 1). However, at 10 Gb/s between g2.8xlarge instances plus ~2-3 GB/s upload/download because virtualization, you will need a rather large layer to get efficient scaling. That said, if you just want to run large models, it should just work. It has, of course, never been tried beyond a single instance to my knowledge. What will be interesting down the road is running this on RDMA-enabled servers that allow inter-system P2P copies. That will need a few more lines of code, but it also ought to work. If you are using Multiple Instance please ensure that all your nodes are in the same placement group. I would try Multiple EC2 instances on the same Placement group connected by Elastic File System so that you can share the data across the instances