# idact - Prometheus sandbox

## Initial setup

Add `idact` to path:

In [1]:
import sys
import os
import bitmath
import logging
import subprocess

sys.path.append('../')  # For running in repo. Unnecessary with pip install.
from idact import *

Hide debug information, setup context manager stack (for testing purposes)

## Add cluster (only first run)

If you need a new SSH key to connect to the cluster, specify its type:

In [2]:
key = KeyType.RSA  # Generate a new RSA key (Default location: ~/.ssh)

If you already have a key, uncomment this and provide its absolute path:

In [3]:
# key = os.path.expanduser('~/.ssh/id_rsa')

If you set `install_key` to False, you will not be asked for a password later, but the key must be installed on the cluster manually.

In [4]:
install_key = True

Add cluster:

In [5]:
cluster = add_cluster(name="pro",
                      user="plggarstka",
                      host="pro.cyfronet.pl",
                      port=22,
                      auth=AuthMethod.PUBLIC_KEY,
                      key=key,
                      install_key=install_key,
                      scratch="$SCRATCH")
save_environment()

2018-11-17 13:57:26 INFO: Generating public-private key pair.


## Load cluster (subsequent runs)

In [6]:
load_environment()
cluster = show_cluster("pro")
cluster

Cluster(pro.cyfronet.pl, 22, plggarstka, auth=AuthMethod.PUBLIC_KEY, key='C:\\Users\\Maciej/.ssh\\id_rsa_vp', install_key=True, disable_sshd=False)

Debug log is saved to `idact.log` for every session if you need to troubleshoot or report bugs, but you can also change log level for messages printed to standard output:

In [7]:
set_log_level(logging.INFO)
save_environment()

In [8]:
node = cluster.get_access_node()
node

Node(pro.cyfronet.pl:22, None)

On your first action, you may be asked for a password to install the key, if you chosen `install_key=True` while adding the cluster.
You can connect explicitly to do this right now:

In [9]:
node.connect()

2018-11-17 13:57:32 INFO: Installing key using password authentication.
Password for plggarstka@pro.cyfronet.pl:22: 


It's important to save the environment installing the key, so it's installed only once:

In [10]:
print(cluster.config.install_key)
save_environment()  # Never install the key again.

False


You can run commands on the login node now:

In [11]:
node.run('whoami')

'plggarstka'

In [12]:
node.run('hostname')

'login01.pro.cyfronet.pl'

## Allocate nodes

In [13]:
nodes = cluster.allocate_nodes(nodes=2,
                               cores=2,
                               memory_per_node=bitmath.GiB(10),
                               walltime=Walltime(minutes=20),
                               native_args={
                                   '--partition': 'plgrid-testing',
                                   '--account': 'intdata'
                               })

2018-11-17 13:58:01 INFO: Creating the ssh directory.


In [14]:
nodes

Nodes([Node(NotAllocated),Node(NotAllocated)], SlurmAllocation(job_id=14224146))

In [15]:
nodes.wait()
nodes

2018-11-17 13:58:10 INFO: Still pending or configuring...


Nodes([Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00),Node(p0213:34531, 2018-11-17 13:18:08.313370+00:00)], SlurmAllocation(job_id=14224146))

## Run commands

In [16]:
nodes[0].run('whoami')

'plggarstka'

In [17]:
nodes[0].run('hostname')

'p0207'

In [18]:
nodes[1].run('squeue')

'JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n          14224146 plgrid-te     wrap plggarst  R       0:12      2 p[0207,0213]'

In [19]:
nodes[1].run('hostname')

'p0213'

## Examine node resources

In [20]:
nodes[0].resources.memory_total

GiB(10.0)

In [21]:
nodes[0].resources.memory_usage

GiB(0.022373199462890625)

In [22]:
nodes[0].resources.cpu_cores

2

In [23]:
nodes[0].resources.cpu_usage

0.0

## Tunnel

In [24]:
tunnel = nodes[0].tunnel(here=9000, there=10000)

In [25]:
tunnel

MultiHopTunnel(9000:10000)

In [26]:
tunnel.close()

## Deploy notebook

You need to provide Bash commands that will expose a Python distribution you want work with on the cluster. It should have `idact` installed. Installing `idact` with pip will also install the required `jupyter` package.

In [27]:
cluster.config.setup_actions.jupyter = ['module load plgrid/tools/python-intel/3.6.2']
save_environment()

To run Jupyter Notebook on the cluster:

In [28]:
nb = nodes[0].deploy_notebook()
nb

JupyterDeployment(8080 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)

In [29]:
nodes[0].resources.memory_usage

GiB(0.08155441284179688)

In [30]:
nb.local_port

8080

To open the deployed notebook server in a new tab:

In [31]:
nb.open_in_browser()

In [32]:
nodes[0].resources.memory_usage

GiB(0.0815582275390625)

### Push and pull notebook

You can access the deployed notebook from multiple places by first pushing it:

In [33]:
cluster.push_deployment(nb)

2018-11-17 13:58:37 INFO: Pushing deployment: JupyterDeployment(8080 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)


And then pulling:

In [34]:
deployments = cluster.pull_deployments()
deployments.jupyter_deployments

2018-11-17 13:58:45 INFO: Pulling deployments.
2018-11-17 13:58:48 INFO: Creating the ssh directory.
2018-11-17 13:58:54 INFO: Desired local tunnel port 8080 is taken. Binding to random port instead.
2018-11-17 13:58:56 INFO: Pulled Jupyter deployment: JupyterDeployment(53532 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)


[JupyterDeployment(53532 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)]

In [35]:
nb_2 = deployments.jupyter_deployments[0]
nb_2

JupyterDeployment(53532 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)

In [36]:
nb_2.open_in_browser()

In [37]:
nb_2.cancel()
nb.cancel_local()

2018-11-17 13:59:18 INFO: Cancelling Jupyter deployment.


You can find more information on pushing and pulling deployments in next sections.

## idact-notebook app

You can deploy nodes and notebook automatically using the following command:
```
idact-notebook
```
or:
```
python -m idact.notebook
```
Help message:

In [38]:
help_message = subprocess.getoutput(
    "cd .. && {python} -m idact.notebook --help".format(
        python=sys.executable))
print(help_message)

Usage: notebook.py [OPTIONS] CLUSTER_NAME

  A console script that executes a Jupyter Notebook instance on an allocated
  cluster node, and makes it accessible in the local browser.

  CLUSTER_NAME argument is the cluster name to execute the notebook on. It
  must already be present in the config file.

Options:
  -e, --environment TEXT  Environment path. Default: ~/.idact.conf or the
                          value of IDACT_CONFIG_PATH.
  --save-defaults         Save allocation parameters as defaults for next
                          time.
  --reset-defaults        Reset unspecified allocation parameters to defaults.
  --nodes INTEGER         Cluster node count. [Allocation parameter]. Jupyter
                          notebook will be deployed on the first node.
                          Default: 1.
  --cores INTEGER         CPU core count per node. [Allocation parameter].
                          Default: 1
  --memory-per-node TEXT  Memory per node. [Allocation parameter]. Default

For example, to deploy a notebook on a cluster with the same parameters as above, you could call:

```
python -m idact.notebook pro --save-defaults --environment notebooks/.idact-env --nodes 2 --cores 2 --memory-per-node 10GiB --walltime 0:20:00 --native-arg --partition plgrid-testing --native-arg --account intdata
```

The flag `--save-defaults` is optional, but it saves the allocation parameters: next time, the following will have the same effect:
```
python -m idact.notebook pro --environment notebooks/.idact-env
```
The `--environment` argument is optional if you use the default environment location.

The allocation and notebook the application deploys can be pulled from the cluster.

## Deploy Dask

You need to provide a list of Bash commands that will expose a Python distribution you want to deploy Dask with. It will likely be the same distribution as for the Jupyter notebook above.

Deploying Dask requires `dask`, `distributed`, and `bokeh` on the cluster. If you install `idact` with pip, they will be installed automatically.

In [39]:
cluster.config.setup_actions.dask = ['module load plgrid/tools/python-intel/3.6.2']
cluster.config.scratch = '$SCRATCH'
save_environment()

In [40]:
dd = deploy_dask(nodes)
dd

2018-11-17 14:00:08 INFO: Deploying Dask on 2 nodes.
2018-11-17 14:00:08 INFO: Connecting to p0207:53690 (1/2).
2018-11-17 14:00:08 INFO: Connecting to p0213:34531 (2/2).
2018-11-17 14:00:09 INFO: Deploying scheduler on the first node: p0207.
2018-11-17 14:00:21 INFO: Checking scheduler connectivity from p0207 (1/2).
2018-11-17 14:00:21 INFO: Checking scheduler connectivity from p0213 (2/2).
2018-11-17 14:00:21 INFO: Deploying workers.
2018-11-17 14:00:21 INFO: Deploying worker 1/2.
2018-11-17 14:00:38 INFO: Deploying worker 2/2.
2018-11-17 14:00:48 INFO: Validating worker 1/2.
2018-11-17 14:00:48 INFO: Validating worker 2/2.


DaskDeployment(scheduler=tcp://localhost:50426/tcp://172.20.64.207:50426, workers=2)

In [41]:
nodes[0].resources.memory_usage

GiB(0.2924842834472656)

Get Dask client:

In [42]:
client = dd.get_client()
client

0,1
Client  Scheduler: tcp://localhost:50426  Dashboard: http://localhost:56524/status,Cluster  Workers: 2  Cores: 4  Memory: 21.47 GB


In [43]:
nodes[0].resources.cpu_usage

8.0

Computation will work only if Python and library versions match:

In [44]:
#x = client.submit(lambda value: value + 1, 10)
#x.result() == 11

Diagnostics servers are tunnelled:

In [45]:
dd.diagnostics.addresses

['http://localhost:56524/status',
 'http://localhost:59343/main',
 'http://localhost:58262/main']

To open diagnostics servers in new tabs:

In [46]:
dd.diagnostics.open_all()

### Push and pull Dask deployment

You can synchronize Dask deployments with the cluster, same as Jupyter deployments:

In [47]:
cluster.push_deployment(dd)

2018-11-17 14:02:18 INFO: Pushing deployment: DaskDeployment(scheduler=tcp://localhost:50426/tcp://172.20.64.207:50426, workers=2)


In [48]:
deployments = cluster.pull_deployments()
deployments.dask_deployments

2018-11-17 14:02:33 INFO: Pulling deployments.
2018-11-17 14:02:38 INFO: Creating the ssh directory.
2018-11-17 14:02:58 INFO: Desired local tunnel port 50426 is taken. Binding to random port instead.
2018-11-17 14:03:00 INFO: Desired local tunnel port 56524 is taken. Binding to random port instead.
2018-11-17 14:03:01 INFO: Desired local tunnel port 59343 is taken. Binding to random port instead.
2018-11-17 14:03:03 INFO: Desired local tunnel port 58262 is taken. Binding to random port instead.
2018-11-17 14:03:05 INFO: Pulled Jupyter deployment: JupyterDeployment(8080 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)
2018-11-17 14:03:05 INFO: Pulled Dask deployment: DaskDeployment(scheduler=tcp://localhost:53693/tcp://172.20.64.207:50426, workers=2)
2018-11-17 14:03:11 INFO: Retried and failed: config.retries[Retry.VALIDATE_HTTP_TUNNEL].{count=3, seconds_between=2}
2018-11-17 14:03:11 INFO: Discarding a Jupyter deployment, because it is no longer functional: JupyterDeployment(80

[DaskDeployment(scheduler=tcp://localhost:53693/tcp://172.20.64.207:50426, workers=2)]

In [49]:
dd_2 = deployments.dask_deployments[-1]
dd_2

DaskDeployment(scheduler=tcp://localhost:53693/tcp://172.20.64.207:50426, workers=2)

In [50]:
client_2 = dd_2.get_client()
client_2

0,1
Client  Scheduler: tcp://localhost:53693  Dashboard: http://localhost:56524/status,Cluster  Workers: 2  Cores: 4  Memory: 21.47 GB


In [51]:
dd_2.diagnostics.addresses

['http://localhost:53699/status',
 'http://localhost:53705/main',
 'http://localhost:53711/main']

In [52]:
dd_2.diagnostics.open_all()

### Cancel Dask deployments

Each client should be closed:

In [53]:
client.close()
client_2.close()

`cancel` cancels the local deployment, `cancel_local` just closes local tunnels.

In [54]:
dd.cancel()
dd_2.cancel_local()

2018-11-17 14:04:01 INFO: Cancelling worker deployment on p0213.
2018-11-17 14:04:07 INFO: Cancelling worker deployment on p0207.
2018-11-17 14:04:15 INFO: Cancelling scheduler deployment on p0207.


## Push and pull nodes

To access the allocated nodes from the cluster, you need to push their deployment first, same as the notebook and Dask deployments:

In [55]:
cluster.push_deployment(nodes)

2018-11-17 14:04:47 INFO: Pushing deployment: Nodes([Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00),Node(p0213:34531, 2018-11-17 13:18:08.313370+00:00)], SlurmAllocation(job_id=14224146))


Then, you would pull the deployment on the cluster:

In [56]:
deployments = cluster.pull_deployments()
deployments

2018-11-17 14:05:44 INFO: Pulling deployments.
2018-11-17 14:05:47 INFO: Creating the ssh directory.
2018-11-17 14:06:06 INFO: Pulled allocation deployment: Nodes([Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00),Node(p0213:34531, 2018-11-17 13:18:08.313370+00:00)], SlurmAllocation(job_id=14224146))
2018-11-17 14:06:06 INFO: Pulled Jupyter deployment: JupyterDeployment(8080 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00)
2018-11-17 14:06:06 INFO: Pulled Dask deployment: DaskDeployment(scheduler=tcp://localhost:50426/tcp://172.20.64.207:50426, workers=2)
2018-11-17 14:06:16 INFO: Retried and failed: config.retries[Retry.VALIDATE_HTTP_TUNNEL].{count=3, seconds_between=2}
2018-11-17 14:06:16 INFO: Discarding a Jupyter deployment, because it is no longer functional: JupyterDeployment(8080 -> Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00).
2018-11-17 14:06:22 INFO: Retried and failed: config.retries[Retry.VALIDATE_HTTP_TUNNEL].{count=3, seconds_between=2}
2018-11-17 14:06:22 

SynchronizedDeployments(nodes=1, jupyter_deployments=0, dask_deployments=0)

In [57]:
nodes = deployments.nodes[0]
nodes

Nodes([Node(p0207:53690, 2018-11-17 13:18:08.313370+00:00),Node(p0213:34531, 2018-11-17 13:18:08.313370+00:00)], SlurmAllocation(job_id=14224146))

Essentially, this feature is intended for using an allocation in multiple notebooks at once.

Deployments are cleared automatically if they are expired or cancelled. They can also be cleared manually by  running:

In [58]:
cluster.clear_pushed_deployments()

2018-11-17 14:07:50 INFO: Clearing deployments.


## Adjust timeouts

Sometimes a timeout occurs during a deployment, and may even cause it to fail. 

If you find this to happen too often, you may need to adjust the timeouts for your cluster.

In order to do that, copy the retry name from the info message preceding the failure that looks similar to this:

```
2018-11-12 22:14:00 INFO: Retried and failed: config.retries[Retry.PORT_INFO].{count=5, seconds_between=5}
```

First, you can look up what the current retry config is:

In [59]:
cluster.config.retries[Retry.PORT_INFO]

RetryConfig(count=5, seconds_between=5)

And adjust the retry count and/or seconds between retries:

In [60]:
cluster.config.retries[Retry.PORT_INFO] = set_retry(count=6,
                                                    seconds_between=10)
cluster.config.retries[Retry.PORT_INFO]

RetryConfig(count=6, seconds_between=10)

Alternatively:

In [61]:
cluster.config.retries[Retry.PORT_INFO].count = 6
cluster.config.retries[Retry.PORT_INFO].seconds_between = 10
cluster.config.retries[Retry.PORT_INFO]

RetryConfig(count=6, seconds_between=10)

### Defaults

In [62]:
get_default_retries()

{<Retry.PORT_INFO: 0>: RetryConfig(count=5, seconds_between=5),
 <Retry.JUPYTER_JSON: 1>: RetryConfig(count=5, seconds_between=3),
 <Retry.SCHEDULER_CONNECT: 2>: RetryConfig(count=5, seconds_between=2),
 <Retry.DASK_NODE_CONNECT: 3>: RetryConfig(count=3, seconds_between=5),
 <Retry.DEPLOY_DASK_SCHEDULER: 4>: RetryConfig(count=3, seconds_between=5),
 <Retry.DEPLOY_DASK_WORKER: 5>: RetryConfig(count=3, seconds_between=5),
 <Retry.GET_SCHEDULER_ADDRESS: 6>: RetryConfig(count=5, seconds_between=5),
 <Retry.CHECK_WORKER_STARTED: 7>: RetryConfig(count=5, seconds_between=5),
 <Retry.CANCEL_DEPLOYMENT: 8>: RetryConfig(count=5, seconds_between=1),
 <Retry.SQUEUE_AFTER_SBATCH: 9>: RetryConfig(count=3, seconds_between=3),
 <Retry.OPEN_TUNNEL: 10>: RetryConfig(count=3, seconds_between=5),
 <Retry.VALIDATE_HTTP_TUNNEL: 11>: RetryConfig(count=3, seconds_between=2),
 <Retry.TUNNEL_TRY_AGAIN_WITH_ANY_PORT: 12>: RetryConfig(count=1, seconds_between=0)}

## Close

In [63]:
nodes.running()

True

In [64]:
nodes.cancel()

2018-11-17 14:08:20 INFO: Cancelling job 14224146.


In [65]:
nodes.running()

False

In [66]:
node.run('squeue')

'JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n          14224146 plgrid-te     wrap plggarst CG      10:11      2 p[0207,0213]'

## Push and pull the environment

When working on a cluster, it may be useful to synchronize idact config with the local machine. Pushing the environment will merge the local environment into the remote environment.

In [67]:
push_environment(cluster)

2018-11-17 14:08:32 INFO: Pushing the environment to cluster.


You will be able to use that environment when working on the remote notebook.

In [68]:
print(node.run('cat ~/.idact.conf'))

{
    "clusters": {
        "pro": {
            "auth": "PUBLIC_KEY",
            "disableSshd": false,
            "host": "pro.cyfronet.pl",
            "installKey": false,
            "key": "/net/people/plggarstka/.ssh/id_rsa_ip",
            "notebookDefaults": {},
            "port": 22,
            "retries": {
                "CANCEL_DEPLOYMENT": {
                    "count": 5,
                    "secondsBetween": 1
                },
                "CHECK_WORKER_STARTED": {
                    "count": 5,
                    "secondsBetween": 5
                },
                "DASK_NODE_CONNECT": {
                    "count": 3,
                    "secondsBetween": 5
                },
                "DEPLOY_DASK_SCHEDULER": {
                    "count": 3,
                    "secondsBetween": 5
                },
                "DEPLOY_DASK_WORKER": {
                    "count": 3,
                    "secondsBetween": 5
                },
                "GET

The reverse operation is pulling the environment, which merges the remote environment into the local environment. Machine-specific information like the private key path is skipped when pushing or pulling.

In [69]:
pull_environment(cluster)

2018-11-17 14:08:46 INFO: Pulling the environment from cluster.


You can remove it if you don't need it for now:

In [70]:
# node.run('rm -v ~/.idact.conf')

## Remove cluster

A cluster can be removed from the environment.

In [71]:
add_cluster(name='fake',
            user='fakeuser',
            host='fakehost',
            port=2222)

2018-11-17 14:09:09 INFO: No auth method specified, defaulting to password-based.


Cluster(fakehost, 2222, fakeuser, auth=AuthMethod.ASK, key=None, install_key=True, disable_sshd=False)

In [72]:
show_clusters()

{'pro': Cluster(pro.cyfronet.pl, 22, plggarstka, auth=AuthMethod.PUBLIC_KEY, key='C:\\Users\\Maciej/.ssh\\id_rsa_vp', install_key=False, disable_sshd=False),
 'fake': Cluster(fakehost, 2222, fakeuser, auth=AuthMethod.ASK, key=None, install_key=True, disable_sshd=False)}

In [73]:
remove_cluster('fake')

In [74]:
show_clusters()

{'pro': Cluster(pro.cyfronet.pl, 22, plggarstka, auth=AuthMethod.PUBLIC_KEY, key='C:\\Users\\Maciej/.ssh\\id_rsa_vp', install_key=False, disable_sshd=False)}