Parallel work on the cluster with IPython Notebook
==================================================

Here we will see how to use IPython notebook to parallelize your work
This notebook is meant to be run on patron. 

Configuring IPython for Cluster work
------------------------------------

IPython handles cluster work in a fairly automated way. But first you need to take care of some configuration. 
The procedure here is adapted from (http://ipython.org/ipython-doc/stable/parallel/parallel_intro.html)
and from http://ipython.org/ipython-doc/stable/parallel/parallel_process.html). 
From patron, you need to run 
> jupyter profile create --parallel --profile=ssh

this will create, in your home directory, the directory ~/.ipython/profile_ssh

In this directory, you need to make some changes. 
You can do that by overwriting the files ipcontroller_config.py and ipcluster_config.py with the version in this repo (profile_ssh folder).

In terms of configuration you are all set. You need to go through this step only once. 


Starting Python Notebook
------------------------

Still from patron, start the notebook server with 
> ipython notebook --no-browser --port=62000

The 62000 is a high range port chosen so that it doesn't upset the security settings of the cluster and the science firewall

you can then connect (from the Science network, or Science VPN) from your browser, by pointing it to 
> http://patron.science.ru.nl:62000

The Home window of IPython notebook will show up. Got to the "clusters" tab. 
You should see two profiles, "default" and "pbs". From "pbs" choose the number of hosts you want to have (how many CPUs you want to use in parallel), e.g. 8 or 16. Click "Start". The cluster should be ready


Parallel code in a notebook
---------------------------

Here you prepare the itnerface to the cluster, the Client object defined below

In [2]:
from ipyparallel import Client
c = Client(profile='ssh')

Now let's define some relatively lengthy calculation

In [3]:
def some_calculations():
    import numpy as np
    a = np.random.uniform(size=[1000,1000])
    for i in range(50):
        b = np.dot(a,a)
    return b

Let's see how long it takes. This runs on the frontend node 

In [4]:
%timeit some_calculations()

1 loops, best of 3: 6.88 s per loop


Now for the cool part. The instruction below (%timeit is just to compute the execution time) runs the same code 8 times in parallel, on 8 compute nodes. You can see that it is much faster, even though it's doing 8 times as much work!

In [5]:
%timeit c[:].apply_sync(some_calculations)

1 loops, best of 3: 7.76 s per loop


Of course, you don't have to repeat the same code on all the compute node, which is pointless... 

Here we create a view to our cluster

In [6]:
dview = c[:]

In [7]:
dview

<DirectView [0, 1, 2, 3,...]>

In [8]:
import numpy as np

Let's make a function that take an argument

In [9]:
def some_other_calculations(e):
    import numpy as np
    return np.sqrt(e)

with map_sync you call the function on a different node, each time with a different argument from the list given as second argument

In [10]:
dview.map_sync(some_other_calculations, np.arange(1,9))

[1.0,
 1.4142135623730951,
 1.7320508075688772,
 2.0,
 2.2360679774997898,
 2.4494897427831779,
 2.6457513110645907,
 2.8284271247461903]

You could give as second argument for example a list of sessions, and you would have them all done in parallel (or at least 8 of them in this case). This is called the "direct" cluster interface, where you control each node explicitly. Other interfaces may allow you more flexibility, and for example balance the load between nodes if some jobs are  shorter than other. We'll get to that in another notebook.

A typical workflow may imply setting up and debug your analysis on one session interactively, e.g. on your computer, then move the notebook to the cluster and run it on the rest of the data with this mechanism.