Feature Highlight: Dataflow engine

Minjie Wang edited this page Apr 20, 2015 · 17 revisions

Motivation

Minerva's design goal is to offer users more flexibility and yet preserve the efficiency during runtime. Therefore, Minerva decides to provide the numpy-like NArray interface for user to write any kind of algorithm as they wish (hopefully). We directly map these NArray operators to efficient CPU and GPU kernels which meets the basic requirements of speed as many other tools outside did. But this is far from perfect. In fact, there are lots of parallelisms within the algorithm structure which a tool could utilize to further speed up the algorithm

Example of Parallelism

Consider the back propagation of a multi-layer perception below (written in Minerva's owl package and the complete example is in mnist_mlp.py).

# bp
s2 = self.w2.trans() * s3
# grad
gw2 = s3 * a2.trans() / num_samples
gb2 = s3.sum(1) / num_samples

Line s2 = self.w2.trans() * s3 is to calculate the error of hidden layer by backpropagating from the classifier layer. Line gw2 = s3 * a2.trans() / num_samples and gb2 = s3.sum(1) / num_samples are calculating the gradient of the weight and bias respectively.

Not that the data dependencies among those NArrays are as follows:

  • s2 -> {w2, s3}
  • gw2 -> {s3, a2}
  • gb2 -> {s3}

Therefore, s2, gw2 and gb2 are independent and thus could be executed in paralellel without any worry about data race issues.

Dataflow representation

By this observation, Minerva tries to extract these data dependencies automatically. And we try to keep this transparent for users as much as possible so that you could write your program as usual.

The idea is to peek into the future. For the above example, if when s2 = self.w2.trans() * s3 is executed, we could know in advance that there will be two independent operations in the following, then they could be executed in the same time. Such fortune-telling technology is called lazy evaluation.

Basically, when executing code like s2 = self.w2.trans() * s3, Minerva does not evaluate the concrete value, but instead records the operator in a dataflow graph like follows.

Minerva's underlying engine is a dataflow engine that executes such dataflow graph asynchronously using multiple threads on multiple GPUs. User thread (python thread when using owl or main thread when using c++) is in fact generating the dataflow graph for Minerva's dag engine. However, such laziness could not last forever. To stop the laziness (or to block-wait for execution to be finished), one could use following approaches:

  • Extract concrete value: When the concrete value needs to be extracted from the NArray data structure for printing, checkpoiting, debugging and so on, the user thread would wait for the dag engine to finish computing.
    • In C++: std::shared_ptr<float> minerva::NArray::Get()
    • In Python: owl.NArray.to_numpy()
    • Some operations (e.g. count_zero) will return the concrete value to users, so they are also blocking calls. Please refer to the API cheatsheet where blocking calls are marked in bold texts.
  • Explicitly wait for some ndarrays: Minerva provides an wait member function of an NArray for this. When some_array.wait() is called, the user thread will block until some_array is concretely evaluated.
    • In C++: void minerva::NArray::Wait()
    • In Python: owl.NArray.wait()
  • Call system-wise wait function: If the function is called, the user thread will not proceed until all underlying computations finish.
    • In C++: void minerva::MinervaSystem::WaitForAll()
    • In Python: owl.wait_for_all()
  • Call system finalize: Usually used before main thread exit.
    • In C++: void minerva::MinervaSystem::Finalize()
    • In Python: owl.finalize()

Work-flow illustration

A complete work-flow is shown in the following figure.

[work flow] TBD

Benefits

  1. Independent operators could be executed in parallel.
  2. Overlapping I/O and computation: Since the real computation is performed by the underlying system which uses threads other than main/user thread, the user thread could be used to do I/O in the same time like follows:

    for i in range(0, 10):
        # data preparation using python
        minibatch_numpy = load_minibatch(i)
        minibatch = owl.from_numpy(minibatch_numpy)
        # do training using minerva
        train_minibatch(minibatch)
  3. Easily implement data parallelism when training on multiple GPUs. (More details in this section).

Some intrusive maneuvers

An ideal system should be able to make users feel like they are still writing sequential programs while automatically parallelizing operations. However, Minerva is still in its primary shape, so there are still some unnature stuffs to be paid attention to.

Performance profiling

Since the computation is performed lazily and asynchronously by engine threads, so you could not measure the real elapse time easily. An accurate way to measure the performance should be follows:

import time
owl.wait_for_all()
start_time = time.time()
'''
The codes you need to measure.
'''
owl.wait_for_all()
elapsed_time = time.time() - start_time

owl.wait_for_all() is a function that will block till all the computations (not only a specific NArray) before have finished.

Memory consumption

Since all the independent tasks would be executed concurrently, sometimes the total amount of memory required would exceed the GPU memory. For example, in the code we shown above:

for i in range(0, 10):
    # data preparation using python
    minibatch_numpy = load_minibatch(i)
    minibatch = owl.from_numpy(minibatch_numpy)
    # do training using minerva
    train_minibatch(minibatch)

All the 10 minibatches and would be loaded to GPU in the same time which may crash the program. A solution to this is to add a wait call once per several iterations:

for i in range(0, 10):
    # data preparation using python
    minibatch_numpy = load_minibatch(i)
    minibatch = owl.from_numpy(minibatch_numpy)
    # do training using minerva
    train_minibatch(minibatch)
    # wait every 4 minibatches
    if (i % 4) == 0:
        owl.wait_for_all()

A more nature and non-intrusive way to deal with this is to let the system schedule tasks that will not exceed the memory restriction. Minerva's current scheduler is a dumb one and needs more efforts.

Turn off dataflow execution

It is also possible to turn off the parallel execution back to sequential execution. Just pass --use_dag=false when running Minerva-triggered applications.