# Interactive Parallel Computing on the Local Machine

*Computers have more than one core.* Wouldn't it be nice if we could use all the cores of our local machine from our [Jupyter][IP] notebook?

[Jupyter][IP] makes this fairly easy. One of the tabs of your browser has the title "Home". If you switch to that tab, there's are several tabs within the web page. One of them is called "IPython Clusters". Click on "IPython Clusters", increase the number of engines in the "default" profile to 4, and click on Start. The status changes from stopped to running. After you did that come back to this tab.

If the "Clusters" tab shows the message

    Clusters tab is now provided by IPython parallel. See IPython parallel for installation details.
    
you need to quit your notebook server (make sure all your notebooks ar saved) and run the command 

    ipcluster nbextension enable
    
Now, when you start `jupyter notebook` you should see a field that lets you set the number of engines in the "IPython Clusters" tab.




[IP]: http://www.jupyter.org

Now let's see how we access the "Cluster". [IPython][IP] comes with a module [ipyparallel][IPp] that is used to access the engines, we just started. We first need to import Client.

[IPp]: https://ipyparallel.readthedocs.io/en/latest/
[IP]: http://www.ipython.org

In [None]:
from ipyparallel import Client

In [None]:
rc = Client(profile="default")

We can list the ids of the engines attached

In [None]:
rc.ids

and we create views of the engines by slicing.

In [None]:
v01 = rc[0:2] # First two engines (0 and 1)
v23 = rc[2:4] # Engines 2 and 3
dview = rc[:] # All available engines

## Parallel Magic

IPython provides a magic command ``%px`` to execute code in parallel. The target attribute is used to pick the engines, you want.

Note, the commands prefixed with ``%px`` are *not* executed locally. 

In [None]:
%px import numpy as np # import numpy on all engines as np
import numpy as np # do it locally, too.

To execute a command both remotely and locally, you can use %%px and add `--local` as option.

In [None]:
%%px --local 
np.__version__ # print the numpy version of the engines. Not how the output is prefixed. It can be accessed that way, too. 

 The engines run ipython. Magic commands work, too.

In [None]:
%px %matplotlib inline

In [None]:
%matplotlib inline

In [None]:
%px import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

Sometimes it is useful to be able to execute more than a single statement. The cell magic command %%px lets us do that. The option ``--target`` lets us choose which engines we want to use. Here we are using engines 0 to 3.

In [None]:
%%px --target 0:4
a = np.random.random([10,10])
plt.imshow(a, interpolation="nearest")

Yes, the output can be graphical.

Note that the imports, we performed with %px are not available in our notbook. We can fix that by using

In [None]:
with rc[:].sync_imports():
    import matplotlib.pyplot

Unfortunately mapping of namespaces does not work that way.

## Using the DirectView

The DirectView as the name implies lets you control each engine directly. You can push data to a particular (set of) engine(s). You can have the engine(s) execute a command and get results back. You decide if a command should be blocking or not.

We can, for example, create two random 100 by 100 element matrices on each engine, multiply them, and then display them. On each engine the code would look like this

In [None]:
a = np.random.random([100, 100])
b = np.random.random([100, 100])
c = a.dot(b)
plt.imshow(c, interpolation="nearest")

As we learned before, we can use the ``%%px`` cell magic to execute this on all engines. Here we use the ``--target`` option to specify every second engine starting at 0. ``%px`` and ``%%px`` use the currently active view. By default that's the first view created. You can make a view active by calling ``view.activate(suffix)``. Use ``view.activate?`` to learn more about suffix.

In [None]:
%%px --target 0::2
a = np.random.random([100, 100])
b = np.random.random([100, 100])
c = a.dot(b)
plt.imshow(c, interpolation="nearest")

The previous calls were done blocking because the graphical output is blocking. You can ask the view if it is blocking.

In [None]:
dview.block

Let's leave out the imshow command.

In [None]:
%%px 
a = np.random.random([100, 100])
b = np.random.random([100, 100])
c = a.dot(b)

## Exploring Latency and Bandwidth

Latency (the time until something happens) and bandwith (the amount of data we get through the network) are two important properties of your parallel system that define what is practical and what is not. We will use the ``%timeit`` magic to measure these properties. ``%timit`` and its sibbling ``%%timeit`` measure the run time of a statement (cell in the case of ``%%timeit``) by executing the statement multiple times (by default at least 3 times). For short running routines many loops of 3 executions are performed and the minimum time measured is then displayed. The number of loops and the number of executions can be adjusted. Take a look at the documentation. Give it a try.

Lets first see how long it takes to send off a new task using ``execute`` and ``apply``.

In [None]:
dview.block = False

Let's first execute nothing.

In [None]:
%timeit dview.execute('')

Next we'll use a very minimal function. It just returns its argument. In this case the argument is empty.

In [None]:
%timeit dview.apply(lambda x : x, '')

Here, we'll tell every view to perform a matrix-matrix multiplication (see [Matrix-Matrix Multiplication Using a DirectView](Matrix-Matrix-Multiplication-Using-a-DirectView) below for more about matrix multiplications)

In [None]:
%timeit dview.execute('c = a.dot(b)')

Now, we'll make the execution blocking. This means, we are measuring the time the function needs to return a result instead of just the time needed to launch the task.

In [None]:
dview.block=True

In [None]:
%timeit dview.execute('')

In [None]:
%timeit dview.apply(lambda x : x, '')

In [None]:
%timeit dview.execute('c = a.dot(b)')

For comparison, we'll run it without a view.

In [None]:
%timeit a.dot(b)

In [None]:
dview.block=False

We can start about 500 parallel tasks per second and finish about half as many. This gives an estimate of the granularity we need to use this model for efficient parallelization. Any task that takes less time than this will be dominated by the overhead.

To get an idea about the bandwidth available let's push some arrays to the engines. We make this blocking.

In [None]:
dview.block=True

In [None]:
a = np.random.random(256*1024)

In [None]:
%timeit dview.push(dict(a=a))
%timeit dview.push(dict(a=a[:128*1024]))
%timeit dview.push(dict(a=a[:64*1024]))
%timeit dview.push(dict(a=a[:32*1024]))
%timeit dview.push(dict(a=a[:16*1024]))
%timeit dview.push(dict(a=a[:8*1024]))
%timeit dview.push(dict(a=a[:4*1024]))
%timeit dview.push(dict(a=a[:2*1024]))
%timeit dview.push(dict(a=a[:1024]))

Calculate the bandwidth for the largest array and the smallest array.

In [None]:
bwmax = 256 * 8 / 0.00123
bwmin = 8 / 0.00459
print("The bandwidth is between %.2f kB/s and %.2f kB/s." %( bwmin, bwmax))

## Matrix-Matrix Multiplication Using a DirectView

Matrix multiplication is one of the favorites in HPC computing. It's computationally intensive---if done right---, easily parallelized with little communication, and the basis of many real world applications.

Let's say, we have two matrices A and B, where

$$ A = \left ( \begin{array}{cccc}
                4 & 3 & 1 & 6 \\
                1 & 2 & 0 & 3 \\
                7 & 9 & 2 & 0 \\
                2 & 2 & -1 & 4 \\
               \end{array}
       \right ) $$

and 

$$ B = \left ( \begin{array}{cc}
                \frac{1}{12} & \frac{1}{2} \\
                \frac{1}{9}  & \frac{1}{4} \\
                \frac{1}{3}  &      1      \\
                \frac{1}{7}  & -\frac{1}{3}
                \end{array}
       \right ). $$

To calculate the element of $C = A B$ at row *i* and column *j*, we perform a dot (scalar) product of the ith row of A and the jth column of B:

$$ C_{ij} = \sum_k A_{i,k} B_{k, i} $$.

For this to work, the number of columns in $A$ has to be equal to the number of rows in $B$.

We can generate two matrices of size n by n filled with random numbers using ``np.random.random``.

In [None]:
n = 16
A = np.random.random([n, n])
B = np.random.random([n, n])

NumPy includes the dot product. For 2 dimensional arrays ``np.dot`` performs a matrix-matrix multiplication.

In [None]:
C = np.dot(A, B)

In [None]:
%timeit np.dot(A, B)

There are different ways to parallelize a matrix-matrix multiplication. Each element of the matrix can be calculated independently.

In [None]:
%%timeit p = len(rc)
C1h = [[rc[(i * n + j) % p].apply(np.dot, A[i,:], B[:,j]) for j in range(n)] for i in range(n)]
dview.wait()

This, however, produces $n^2$ short tasks and the overhead (latency) is just overwhelming.

We want to calculate

$$ C = \left ( \begin{array}{cccc}
                4 & 3 & 1 & 6 \\
                1 & 2 & 0 & 3 \\
                7 & 9 & 2 & 0 \\
                2 & 2 & -1 & 4 \\
               \end{array}
       \right ) 
              \left ( \begin{array}{cc}
                \frac{1}{12} & \frac{1}{2} \\
                \frac{1}{9}  & \frac{1}{4} \\
                \frac{1}{3}  &      1      \\
                \frac{1}{7}  & -\frac{1}{3}
                \end{array}
       \right ). 
$$

We can split the matrices into tiles. In the above example, we might use a 2 by 2 tile.

$$ C = \left ( \begin{array} {cc}
               a_{00} & a_{01} \\
               a_{10} & a_{11}
               \end{array} \right )
       \left ( \begin{array} {c}
               b_{00} \\
               b_{10}
               \end{array} \right )
     = \left ( \begin{array} {c}
               a_{00} b_{00} + a_{01} b_{10} \\
               a_{10} b_{00} + a_{11} b_{10}
               \end{array} \right )
               ,
$$

where, for example, $a_{00}= \left ( \begin{array}{cc} 4 & 3 \\ 1 & 2 \end{array} \right )$. $a_{00}b_{00}$ is a matrix-matrix product and the addition of two matrices of the same shape is defined element by element.

In our example, we have two $n$ by $n$ matrices and we are going to split them in quadrants.

In [None]:
n = 1024
A = np.random.random([n, n])
B = np.random.random([n, n])

In [None]:
%timeit np.dot(A,B)

In [None]:
a00 = A[:n // 2, :n // 2]
a01 = A[:n / 2, n // 2:]
a10 = A[n // 2:, :n // 2]
a11 = A[n // 2:, n // 2:]
b00 = B[:n // 2, :n // 2]
b01 = B[:n // 2, n // 2:]
b10 = B[n // 2:, :n // 2]
b11 = B[n // 2:, n // 2:]

The calculation of the partial results in Python looks very similar to the mathematical description above:

In [None]:
c00 = np.dot(a00, b00) + np.dot(a01, b10)
c01 = np.dot(a00, b01) + np.dot(a01, b11)
c10 = np.dot(a10, b00) + np.dot(a11, b10)
c11 = np.dot(a10, b01) + np.dot(a11, b11)

In [None]:
%%timeit
c00 = np.dot(a00, b00) + np.dot(a01, b10)
c01 = np.dot(a00, b01) + np.dot(a01, b11)
c10 = np.dot(a10, b00) + np.dot(a11, b10)
c11 = np.dot(a10, b01) + np.dot(a11, b11)

Hm, this is slower than doing it directly...

Next we create one view per engine.

In [None]:
d0 = rc[0]
d1 = rc[1]
d2 = rc[2]
d3 = rc[3]

In [None]:
c00h = d0.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a00, b00, a01, b10)
c01h = d1.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a00, b01, a01, b11)
c10h = d2.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a10, b00, a11, b10)
c11h = d3.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a10, b01, a11, b11)

In [None]:
c00h.wait()
c01h.wait()
c10h.wait()
c11h.wait()

In [None]:
c00 = c00h.get()
c01 = c01h.get()
c10 = c10h.get()
c11 = c11h.get()

In [None]:
%%timeit
c00h = d0.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a00, b00, a01, b10)
c01h = d1.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a00, b01, a01, b11)
c10h = d2.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a10, b00, a11, b10)
c11h = d3.apply(lambda a, b, c, d : np.dot(a, b) + np.dot(c, d), a10, b01, a11, b11)
c00h.wait()
c01h.wait()
c10h.wait()
c11h.wait()

Nothing says, we have to stop at 4 tiles nor do we have to use square tiles. We could also recursively subdivide our tiles.

The code is not any faster, because our implementation of numpy already blocks the matrices and uses all cores, but it shows the principle.