Feature Highlight: Multi GPU Training
Clone this wiki locally
Feature Highlight: Multi-GPU Training
device is a concept for any computing resource. Normally,
device is CPU, GPU#0, GPU#1, etc. But in Minerva's design,
device could beyond the limitation of single computing resource. For example, one could write a
device for hybrid use of CPU and GPU or even on multiple machine. Although, currently, we only use
device to represent single GPU or CPU, it is easy to be extended in the future.
Minerva exposes following interfaces for creating/switching
uint64_t MinervaSystem::CreateCpuDevice(); uint64_t MinervaSystem::CreateGpuDevice(int which); void MinervaSystem::SetDevice(uint64_t devid);
devid = owl.create_cpu_device() devid = owl.create_gpu_device(gpuid) owl.set_device(devid)
create_xxx_devicefunction will return an internal unique id to represent the device.
set_devicefunction tells Minerva all following computations should be executed on the given device till another
set_devicecall. For example,
gpu0 = owl.create_gpu_device(0) gpu1 = owl.create_gpu_device(1) owl.set_device(gpu0) x = owl.zeros([100, 200]) owl.set_device(gpu1) y = owl.zeros([100, 200])
xis created on GPU#0 while
yis created on GPU#1
Run codes on multiple GPUs
Let us look at the example again,
gpu0 = owl.create_gpu_device(0) gpu1 = owl.create_gpu_device(1) owl.set_device(gpu0) x = owl.zeros([100, 200]) owl.set_device(gpu1) y = owl.zeros([100, 200]) z = x + y
We now understand that the first
zeros and the second
zeros will be executed on different cards, but it seems that the two cards are used one by one but not simultaneously. How could we utilize multiple GPUs at the same time?
The answer is Lazy Evaluation.
Recall in Feature-Highlight: Dataflow engine, we introduce how Minerva parallelizes codes using lazy evaluation and dataflow engine. In a word, if two operations are independent, they will be executed at the same time. Therefore, in the above example, not only
y are created on two cards, but also they are created concurrently.
Also note that for
z = x + y,
x is on GPU#0 while
z are on GPU#1, how the data is transferred? In fact, Minerva handle the data transmission transparently, so you do not need to worry about that.
Training Neural Network using Multiple GPUs
By using above concept, it is very easy to use data parallelism to train neural network on multiple GPUs.
Data parallelism is first dispatching partitions of mini-batches to different training units (i.e, one GPU); each unit would train on that part of mini-batch with the same weights; then they will accumulate the gradient generated during training, update the weight and start the next mini-batch. Such paradigm of parallelism is called Bulk Synchronous Parallel. We will soon tell you how to use Minerva's API to express such parallelism in several lines of codes.
Suppose we have a single card training algorithm like follows (Pseudo-code):
train_set = load_train_set(minibatch_size=256) for epoch in range(MAX_EPOCH): for mbidx in range(len(train_set)): (data, label) = train_set[mbidx] grad = ff_and_bp(data, label) update(grad)
Recall the above training algorithm structure in this wiki page. We could convert it to use data parallelism as follows:
gpus = [owl.create_gpu_device(i) for i in range(num_gpu)] train_set = load_train_set(minibatch_size=256/num_gpu) for epoch in range(MAX_EPOCH): for mbidx in range(0, len(train_set), num_gpu): grads =  for i in range(num_gpu): owl.set_device(gpus[i]) # calculate each gradient on different GPU (data, label) = train_set[mbidx + i] grad_each = ff_and_bp(data, label) grads.append(grad_each) owl.set_device(gpus) # (optional) choose GPU#0 for update grad = accumulate(grads) update(grad)
In the above example, each GPU will take in charge of forward and backward propagation on one small mini-mini batch. The gradients are accumulated and weights are updated on GPU#0. Likewise, you do not need to worry about the data transmission among different GPUs. They are handled automatically for you.
Actually, we implement almost the same logic for all our multi-GPU training algorithm. You could find them in here: