The most common practice is to train deep learning models on GPUs or TPUs. However, this strategy does not employ effectively the extensive CPU and memory resources on the server. We introduce a generic deep learning framework on heterogeneous CPU+GPU architectures to maximize convergence rate and resource utilization simultaneously. Two heterogeneous asynchronous stochastic gradient descent (SGD) algorithms are designed. The first algorithm – CPU+GPU Hogbatch – combines small batches on CPU with large batches on GPU in order to maximize the utilization of both resources. The second algorithm – Adaptive Hogbatch – assigns batches with continuously evolving size based on the relative speed of CPU and GPU. See our arXiv paper for more details.
- Heterogeneous Hogbatch algorithms outperform the CPU and GPU-only solutions in time to convergence by large margins. This is also the case for TensorFlow, which is a GPU-only variant.
- Hogwild CPU has the best statistical efficiency. Nonetheless, the Adaptive CPU+GPU algorithm comes within similar performance for all the datasets.
- The heterogeneous algorithms provide consistent performance across two different computing architectures with different number of GPUs and GPU type. The batch size threshold controls the difference between CPU+GPU and Adaptive both in number of model updates and utilization. These have a direct impact on the convergence of the loss function.
- With few exceptions, for low-dimensional datasets, CPU+GPU is superior, while Adaptive is better for sparse high-dimensional data.
- C/C++ using the pthreads library
- OpenMP 3.7.0-3, Intel MKL 2.187
- CUDA 10.0, cuBLAS 10.2.1.243-1
- TensorFlow 1.13.1
- The threads communicate using our custom asynchronous message queue.
The datasets can be downloaded from link for covtype, w8a and real-sim and link for delicious.