Feature Highlight: Multi GPU Training

Minjie Wang edited this page Apr 21, 2015 · 4 revisions

Feature Highlight: Multi-GPU Training

Device

In Minerva, device is a concept for any computing resource. Normally, device is CPU, GPU#0, GPU#1, etc. But in Minerva's design, device could beyond the limitation of single computing resource. For example, one could write a device for hybrid use of CPU and GPU or even on multiple machine. Although, currently, we only use device to represent single GPU or CPU, it is easy to be extended in the future.

Create/switch Devices

Minerva exposes following interfaces for creating/switching devices. In C++:

uint64_t MinervaSystem::CreateCpuDevice();
uint64_t MinervaSystem::CreateGpuDevice(int which);
void MinervaSystem::SetDevice(uint64_t devid);

In Python:

devid = owl.create_cpu_device()
devid = owl.create_gpu_device(gpuid)
owl.set_device(devid)
  • Creation: The create_xxx_device function will return an internal unique id to represent the device.
  • Switch: The set_device function tells Minerva all following computations should be executed on the given device till another set_device call. For example,

    gpu0 = owl.create_gpu_device(0)
    gpu1 = owl.create_gpu_device(1)
    owl.set_device(gpu0)
    x = owl.zeros([100, 200])
    owl.set_device(gpu1)
    y = owl.zeros([100, 200])

    , where x is created on GPU#0 while y is created on GPU#1

Run codes on multiple GPUs

Let us look at the example again,

gpu0 = owl.create_gpu_device(0)
gpu1 = owl.create_gpu_device(1)
owl.set_device(gpu0)
x = owl.zeros([100, 200])
owl.set_device(gpu1)
y = owl.zeros([100, 200])
z = x + y

We now understand that the first zeros and the second zeros will be executed on different cards, but it seems that the two cards are used one by one but not simultaneously. How could we utilize multiple GPUs at the same time?

The answer is Lazy Evaluation.

Recall in Feature-Highlight: Dataflow engine, we introduce how Minerva parallelizes codes using lazy evaluation and dataflow engine. In a word, if two operations are independent, they will be executed at the same time. Therefore, in the above example, not only x and y are created on two cards, but also they are created concurrently.

Also note that for z = x + y, x is on GPU#0 while y and z are on GPU#1, how the data is transferred? In fact, Minerva handle the data transmission transparently, so you do not need to worry about that.

Training Neural Network using Multiple GPUs

By using above concept, it is very easy to use data parallelism to train neural network on multiple GPUs.

Data parallelism is first dispatching partitions of mini-batches to different training units (i.e, one GPU); each unit would train on that part of mini-batch with the same weights; then they will accumulate the gradient generated during training, update the weight and start the next mini-batch. Such paradigm of parallelism is called Bulk Synchronous Parallel. We will soon tell you how to use Minerva's API to express such parallelism in several lines of codes.

Suppose we have a single card training algorithm like follows (Pseudo-code):

train_set = load_train_set(minibatch_size=256)
for epoch in range(MAX_EPOCH):
  for mbidx in range(len(train_set)):
    (data, label) = train_set[mbidx]
    grad = ff_and_bp(data, label)
    update(grad)

Recall the above training algorithm structure in this wiki page. We could convert it to use data parallelism as follows:

gpus = [owl.create_gpu_device(i) for i in range(num_gpu)]
train_set = load_train_set(minibatch_size=256/num_gpu)
for epoch in range(MAX_EPOCH):
  for mbidx in range(0, len(train_set), num_gpu):
    grads = []
    for i in range(num_gpu):
      owl.set_device(gpus[i])  # calculate each gradient on different GPU
      (data, label) = train_set[mbidx + i]
      grad_each = ff_and_bp(data, label)
      grads.append(grad_each)
    owl.set_device(gpus[0])    # (optional) choose GPU#0 for update
    grad = accumulate(grads)
    update(grad)

In the above example, each GPU will take in charge of forward and backward propagation on one small mini-mini batch. The gradients are accumulated and weights are updated on GPU#0. Likewise, you do not need to worry about the data transmission among different GPUs. They are handled automatically for you.

Actually, we implement almost the same logic for all our multi-GPU training algorithm. You could find them in here: