# Host Management

In [1]:
# Following code is needed to preconfigure this notebook
import datetime
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))

import pyflow as pf

scratchdir = os.path.join('/', 'path', 'to', 'scratch')
filesdir = os.path.join(scratchdir, 'files')
outdir = os.path.join(scratchdir, 'out')


class CourseSuite(pf.Suite):
    """
    This CourseSuite object will be used throughout the course to provide sensible
    defaults without verbosity
    """
    def __init__(self, name, **kwargs):
        
        config = {
            'host': pf.LocalHost(),
            'files': os.path.join(filesdir, name),
            'home': outdir,
            'defstatus': pf.state.suspended
        }
        config.update(kwargs)
        
        super().__init__(name, **config)

**ecFlow** is ultimately a framework for executing tasks, but task execution requires a context. **pyflow** makes use of a `Host` object to supply the context for this execution. As such **pyflow** _requires_ a host object to be defined before it will generate any executable nodes in the tree. The `host` can be set at any level (`Suite`, `Family` or `Task`) and is inherited unless overridden.

If the default behaviour of **ecFlow** is required, and task execution is being managed explicitly, the host may be set to `NullHost()` at the `Suite` level. This will suppress all host-related behaviour inside **pyflow**.

For task handling, it is important that the `ecflow_client` is configured (via appropriate environment variables) and that it is correctly called to trigger changes of state in the server. Further, any and all errors that may occur in a script must be correctly caught and reported to the **ecFlow** server.

## Host Arguments

Host classes have many configurable options, but some of these options are available for all host classes and configure the base `Host` class. Other than `name`, all of these are optional, keyword arguments with plausible defaults.

* `name` - the name used for the host. Required (non keyword argument).
* `hostname` - The hostname to run the task on. Defaults to `name` if not supplied
* `scratch_directory` - The path in which tasks will be run, unless otherwise specified. Also to be used within suites when a scratch location is needed.
* `log_directory` - The directory to use for script output. Defaults to `ECF_HOME`, but may need to be changed on systems with scheduling systems to make the output visible to the **ecFlow** server.
* `resources_directory` - The directory to use for suite resources. By default, `scratch_directory` is used.
* `limit` - How many tasks can run on the node simultaneously.
* `extra_paths` - Paths that are to be added to `PATH` on the host.
* `extra_variables` - A dictionary of additional `ECFLOW` variables that should be set to configure the host (e.g. `{'SCHOST': 'hpc'}`).
* `environment_variables` - Additional environment variables to export into all scripts.
* `module_source` - The shell script to source to initialise the module system. Default `None`.
* `modules` - Modules to `module load`
* `purge_modules` - Should a `module purge` command be run (before loading any modules). Default `False`.
* `label_host` - Whether to create an `exec_host` label on nodes where this host is freshly set. Default `True`.
* `user` - The user running the script. May be used to determine paths, or for login details. Defaults to current user.
* `ecflow_path` - The directory containing the `ecflow_client` executable
* `server_ecfvars` - If true, don't define `ECF_JOB_CMD`, `ECF_KILL_CMD`, `ECF_STATUS_CMD` and `ECF_OUT` variables and use defaults from server.
* `submit_arguments` - A dictionary of arguments to pass to the scheduler when submitting jobs, which each key is a label that can be referenced when creating tasks with the `Host` instance.
* `workdir` - Work directory for every task executed within the `Host` instance, if not overriden for a Node.
* `trap_signals` - The list of signals to trap. A default list is used if not set.

## Existing Host Classes

A number of existing host clases have been defined. These can be extended, and alternatives provided.

### `LocalHost`

This is essentially a trivial host. It runs tasks as background processes on the current node - i.e. on the ecflow server, and running as the same user as the server. Other than for examples, this is extremely useful for running tasks that update labels, meters, events and variables on a node that is certain to have the `ecflow_client` working correctly and with no job queuing delay.

In [2]:
host = pf.LocalHost()

### `SSHHost`

Run a script on a remote host which has been accessed by SSH. The `name` argument is treated as the target hostname unless the `hostname` keyword argument is explicitly supplied. By default the user that generated the **pyflow** suite is used, unless the `user` argument is supplied.

The `SSHHost` is special in that it does not require the `ecflow_client` to be installed on the remote host and does not require the presence of any shared filesystems or log servers to make output logs visible to the user. All of the `ecflow_client` commands required are executed on the _server side_, and the script output is piped back through the SSH command.

For these connections to be established, it is necessary that the ecflow server is configured to have SSH access to the target systems using SSH keys. Further, as this requires an SSH connection to be maintained for each of the running commands, it imposes a practical limit on the number of commands that can be run simultaneously on any remote host. There may be value in setting up SSH connections that persist across multiple commands, by making use of the `ControlMaster`, `ControlPath` and `ControlPersist` options in the ssh config file.

In [3]:
host = pf.SSHHost('dhs9999', user='max', scratch_directory='/data/a_mounted_filesystem/tmp')

The `SSHHost` class can also take additional optional arguments `indirect_host` and `indirect_user`. If `indirect_host` is supplied then a two-hop connection is made, such that a connection is made to the `indirect_host`, and then a further SSH connection is made to the real host. Note that this is not the same as using a `ProxyCommand` configured to a normal SSH connection - the credentials for the second hop are held on the intermediate system. `indirect_user` defaults to `user` if it is not supplied.

In [4]:
host = pf.SSHHost('cloud-mvr001',
                  user='mover-user',
                  indirect_host='cloud-gateway',
                  indirect_user='cloud-user')

### `PBSHost`

Connects to a remote host by SSH, and submits a job on the batch scheduling system. As this task will run asynchronously on a remote system this _requires_ the `ecflow_client` to be available, and if it is not at the default location this should be configured with the `ecflow_path` keyword argument.

It is anticipated that for real use this class will be derived from to add and configure site-specific functionality (such as knowledge of, and handling of, queues).

It is likely that the `log_directory` will need to be modified, and the `ECF_LOGHOST` and `ECF_LOGPORT` variables are likely to be needed to operate with a log server to get output working fully.

### `SLURMHost`

This executes scripts on a remote system, by ssh-ing in and submitting to the SLURM job scheduling system. This is very much analagous to the `PBSHost`.

## Limits

`Host` objects accept an argument `limit=`. This can be used to construct a limit (preferably in a sensible location within the suite). Once this has been set up then any `Task` that is created using this host object will automatically be added to the limit for the given host.

Note that this implies that the same host _object_ should be used to configure `Tasks` throughout the suite, rather than just using host objects that refer to the same host.

In [6]:
with CourseSuite('limits', host=pf.LocalHost(limit=3)) as s:
    
    with pf.Family('limits'):
        s.host().build_limits()
        
    pf.Task('t1', script='I am limited')

s

## Job Characteristics

In **pyflow**, a task is generated as a synthesis of multiple pieces of information:

- The Task object in the suite - _when_ to run
- The Script object (script attribute on Task) - _what_ to run
- The Host object - *how* to run
 
The combination of these three components provides the information to determine _when_, _what_, and _how_ a task should be executed. The Host object is important as it provides two major components:

1. A mechanism by which a task should be executed. This reduces to the `ECF_JOB_CMD` and associated machinery.
2. Preamble and Postamble material that is used for consting the script to execute.
 
Unfortunately, the breakdown is not nearly so clear in real life. Consider the case of one of the HPC machines. We can:

- Run a task on the head node as a simple SSHHost
- Submit a serial, fractional or parallel job
- Submit jobs using various (machine specific) resource requirements
 
This is a problem. Conceptually properties such as the number of cores and nodes, whether to use hyperthreading or hugepages are properties of the Task but they depend very strongly on the Host.

Currently all properties that determine the execution process must belong to the Host. These can be parameterised to use **ecFlow** variables that are set on `Families` or `Tasks`, but this is a bit of a hack. We would like this parameterisation to only be needed if those properties should be changeable at runtime (e.g. by the operators).

The `Host` `submit_arguments` dictionary is used to pass arguments to the scheduler when submitting jobs. Each key in this dictionary is a label that can be referenced when creating tasks with the `Host` instance. This allows for flexible job submission configurations based on the host's capabilities and requirements.

### Example

```python
from pyflow import Suite, Task, SlurmHost
# Create a suite with a local host
suite = Suite(
    "example_suite", 
    host=SlurmHost(
        name="slurm_host",
        submit_arguments={"simple_jobs": {"job_name": "%TASK%", "partition": "compute", "time": "01:00:00"}}
        workdir="$JOBSWDIR",
    ),
)
with suite:
    # Add a task to the suite
    Task("example_task", script="echo 'Hello, World!'", submit_arguments="simple_jobs")
```
The above code will generate a task that runs the command `echo 'Hello, World!'` on a SLURM-managed host. The task will be submitted with the specified job name, partition, and time limit. The generated script will look something like this:

```bash (title="example_task.ecf")
#!/bin/bash
# This file is generated by pyflow
# SBATCH --partition=compute
# SBATCH --time=01:00:00
# SBATCH --job-name=%TASK%

[[ -d "$JOBSWDIR" ]] || mkdir -p "$JOBSWDIR"
cd "$JOBSWDIR"
echo "Current working directory: $(pwd)"

%nopp
echo 'Hello, World!'
(...)
```
