<div>
<center><img src="Flux-logo.svg" width="400"/>
</div>

# Module 2: Using Flux for traditional and hierarchical schedulinng

Flux provides powerful and advanced scheduling capabilities that are important for exascale systems like El Capitan. In this module, we demonstrate:
1. Traditional batch scheduling with Flux (similar to what is provided by other schedulers like Slurm)
2. Hierarchical scheduling with Flux to achieve higher throughput (novel capability of Flux)

## Traditional batch scheduling with Flux

In traditional batch scheduling (e.g., what Slurm provides), users send requests for resources and jobs to a centralized service (i.e., the scheduler), which stores the requests in a queue and fulfills them as possible.

<figure>
<img src="img/single-submit.png">
<figcaption>
<i>Image created by Vanessa Sochat for Flux Framework Components documentation</i></figcaption>
</figure>

Traditional schedulers provide 3 main operations:
1. Submitting jobs
2. Running distributed applications within a job
3. Querying the status of jobs or canceling running jobs

We use Flux to perform these traditional batch scheduling operations in the order shown in this table:

<table>
    <tr>
        <th>Operation</th>
        <th>Slurm</th>
        <th>Flux</th>
    </tr>
    <tr>
        <td>Submitting jobs</td>
        <td><code>sbatch</code></td>
        <td><code>flux batch</code></td>
    </tr>
    <tr>
        <td>Submiting interactive jobs</td>
        <td><code>salloc</code></td>
        <td><code>flux alloc</code></td>
    </tr>
    <tr>
        <td>Running distributed applications with waiting for completion</td>
        <td><code>srun</code></td>
        <td><code>flux run</code></td>
    </tr>
    <tr>
        <td>Running distrubted applications without waiting for completion</td>
        <td>N/A</td>
        <td><code>flux submit</code></td>
    </tr>
    <tr>
        <td>Querying the status of jobs</td>
        <td><code>squeue</code>/<code>scontrol show job <i>job_id</i></code></td>
        <td><code>flux jobs</code>/<code>flux job info <i>job_id</i></code></td>
    </tr>
    <tr>
        <td>Cancelling running jobs</td>
        <td><code>scancel</code></td>
        <td><code>flux cancel</code></td>
    </tr>
</table>

For a more comprehensive cross-reference between Slurm, Flux, and other schedulers, check out LLNL's [Batch System Cross-Reference Guides](https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-cross-reference-guides).

### Submitting jobs

Similar to Slurm's `sbatch`, users submit non-interactive, batch script-based jobs using `flux batch`. To see how `flux batch` works, let's start by looking at the batch script `sleep_batch.sh`.

In [None]:
from IPython.display import Code
Code(filename='sleep_batch.sh', language='bash')

Similar to a Slurm batch script, a Flux batch script consists of two main sections:
1. A set of Flux directives defining the arguments that should be passed to `flux batch`
2. The commands defining the job

In `sleep_batch.sh`, there are 3 directives:
1. `#FLUX: --nodes=2`: tells Flux to create an allocation of 2 nodes for this job
2. `#FLUX: --nslots=2`: tells Flux to reserve 2 slots total for this job
3. `#FLUX: --cores-per-slot=1`: tells Flux to reserve 1 core per slot for this job

The rest of this batch script contains several `echo` commands follwed by 2 `flux run` commands that will sleep for 30 seconds each.

Let's try to run our batch job with `flux batch`. Note that we provide two extra flags to `flux batch`. Similar to Slurm, flags passed on the command line are added to the set of flags specified in the Flux directives. In this case, the `--output=kvs` and `--error=kvs` flags redirect `stdout` and `stderr` to the Flux key-value store (which will be covered in [Module 3](./03_flux_framework.ipynb)), which allows it to be tracked by the `flux watch` command.

In [None]:
!flux batch --output=kvs --error=kvs ./sleep_batch.sh
!flux watch $(flux job last)

### Submitting interactive jobs

Similar to Slurm's `salloc`, users can submit interactive jobs using `flux alloc`. When launching an interactive job, you can request resources using the same flags that you would pass to `flux batch` (e.g., `-N` for requesting a number of nodes).

Due to Jupyter's lack of a pseudo-terminal, we cannot show `flux alloc` in this notebook. So, we will open a terminal in Jupyter. To do so, click on `FILE -> NEW -> TERMINAL`. Then, copy and paste the following commands into the terminal:

```bash
$ flux alloc --nodes=2 --nslots=2 --cores-per-slot=1
$ ./hello-batch.sh
$ cat /tmp/hello-batch-1.out
$ cat /tmp/hello-batch-2.out
$ cat /tmp/hello-batch-3.out
$ cat /tmp/hello-batch-4.out
```

The `hello-batch.sh` script (shown below) runs 4 `flux submit` commands that print output to the 4 files that we run `cat` on. It then runs `flux job wait --all`, which waits for all 4 `flux submit` commands to finish.

In [None]:
from IPython.display import Code
Code(filename='hello-batch.sh', language='bash')

#### Optional: connecting to an existing Flux instance using flux proxy

TODO check if this text or original example should be put in supplement

One cool feature that Flux provides is the ability to connect to an existing Flux instance/allocation from any other node of the system using `flux proxy`. To use this command, we first need to get the ID of the Flux instance we want to connect to. Assuming the interactive job we just launched is still running, we can get the job ID (which is the same as the instance ID) using `flux jobs`. Once we have that ID, we can run `flux proxy <ID>` to connect to that Flux instance.

Once we're connected to the interactive allocation, we can run the following job in one terminal:
```bash
$ flux run --nodes=1 sleep inf
```

Then, in the other terminal, we should be able to see that `flux run` by running `flux jobs`.

### Running distributed applications with waiting for completion

Similar to Slurm's `srun`, users can run distributed (e.g., MPI) applications and wait for completion using `flux run`. To see how `flux run` works, let's run the following command.

In [None]:
!flux run -n 4 --label-io --time-limit=5s --env-remove=LD_LIBRARY_PATH hostname

This command does the following:
1. Remove `LD_LIBRARY_PATH` from the environment of each `hostname` program (specified by `--env-remove=LD_LIBRARY_PATH`)
2. Launch 4 copies of the `hostname` program and waits for all of them to complete before finishing (specified by `-n 4`)
3. Prepend the task rank to each line of `stdout` and `stderr` (specified by `--label-io`)
4. Kill the job automatically after 5 seconds (specified by `--time-limit=5s`)

### Running distributed applications without waiting for completion

Unlike Slurm, Flux provides the `flux submit` command to run distributed (e.g., MPI) applications **without** waiting for the application to complete. This allows users to easily run multiple distributed applications in parallel *under the same job*, which is important for many modern HPC applications such as workflows.

To see how `flux submit` works, let's look at `hello-batch.sh` again:

In [None]:
from IPython.display import Code
Code(filename='hello-batch.sh', language='bash')

As you can see, this script runs 4 different `flux submit` commands, each of which prints a message to a different file. If this script were to use `flux run`, these commands would run one after the other. Instead, by using `flux submit` instead of `flux run`, Flux can run all of these `echo` programs in parallel (assuming there are enough resources to do so). This means the job that runs this script can (theoretically) complete **4 times faster** than it could using `flux run`.

Because `flux submit` does not wait for jobs, batch scripts that use this command must use another approach for waiting on job completion. To help with this scenario, Flux provides the `flux job wait` command, which waits for the specified job/program (or all of them if the `--all` flag is provided) to complete. *Note that, to use `flux job wait`, you must pass the `--flags=waitable` flag to your Flux command.*

To see `flux submit` in action, let's run `hello-batch.sh` through `flux batch`.

In [None]:
!flux batch --flags=waitable --out /tmp/flux-batch.out -N2 ./hello-batch.sh
!flux job wait $(flux job last)
!cat /tmp/hello-batch-1.out
!cat /tmp/hello-batch-2.out
!cat /tmp/hello-batch-3.out
!cat /tmp/hello-batch-4.out

Flux also includes 2 more convenient options for submitting multiple copies of the same or similar jobs in parallel.

First, there is `flux bulksubmit`. This command enqueues jobs based on a set of inputs which are substituted on the command line, similar to `xargs` and the GNU `parallel` utility. Unlike those programs, the jobs created by `flux bulksubmit` have access to the resources of an entire Flux instance instead of only the local system.

Let's run a simple example of `flux bulksubmit` to see it in action.

In [None]:
!flux bulksubmit --watch --wait echo {} ::: foo bar baz

The flags provided to `flux bulksubmit` tell it to print the output of each job to the terminal and wait for all the jobs to finish before returning.

Second, there is the `-cc` flag to `flux submit`. This flag tells Flux to spawn multiple copies of a single command with different job IDs. Unlike `flux bulksubmit`, you cannot substitute arbitrary values into the command. Instead, when using the `-cc` flag, you can only substitute the job ID using `{cc}`.

Let's run a simple example of `flux submit`.

In [None]:
!flux submit --cc=1-10 --watch hostname

### Querying the status of jobs

Similar to Slurm's `squeue`, users can check the status of all their jobs using `flux jobs`. To see what information `flux jobs` gives us, let's start a bunch of jobs.

In [None]:
!flux submit hostname
!flux submit -N1 -n2 sleep inf
!flux run hostname
!flux run /bin/false
!flux run -n4 --label-io --time-limit=5s --env-remove=LD_LIBRARY_PATH hostname
!flux submit --cc=1-10 --watch hostname
!flux submit -N1 -n2 sleep inf

To see the status of all pending, running, or completed jobs, we will run `flux jobs`.

In [None]:
!flux jobs

Users can also filter or expand what jobs they see by providing flags to `flux jobs`. The full list of flags can be obtained using `flux jobs --help` (for usage statement style) or `flux help jobs` (for man page style).

Let's run the two code cells below to see information on all completed jobs and failed jobs respectively.

In [None]:
!flux jobs -a

In [None]:
!flux jobs -f failed

### Canceling running jobs

Similar to Slurm's `scancel`, users can kill running jobs and cancel pending jobs using `flux cancel`. This command can be used to kill/cancel individual jobs or all jobs.

Let's run the command below to cancel the last submitted job. Note that `flux job last` gives us the ID of the most recently submitted job.

In [None]:
!flux cancel $(flux job last)

In [None]:
!flux jobs

Now, let's run the `flux cancel --all` to cancel all running and pending jobs.

In [None]:
!flux cancel --all

In [None]:
!flux jobs

## Hierarchical scheduling with Flux

With traditional batch schedulers (e.g., Slurm), all job requests from all users are submitted to one centralized service. In this case, the maximum job throughput is one job per second.

<figure>
<img src="img/single-submit.png">
<figcaption>
<i>Image created by Vanessa Sochat for Flux Framework Components documentation</i></figcaption>
</figure>

The throughput of this approach is limited by the scheduler's ability to process a single job. To improve throughput, Flux introduces the ability to launch multiple Flux instances within an existing Flux instance. This creates a hierarchy of Flux instances across which job requests can be distributed. For example, let's say we create a Flux instance that has control of some number of nodes. We then create 3 child instances (each with its own scheduler and queue). By scheduling across this hierarchy of instances, we get a throughput of 1x3, or 3 jobs per second.

<figure>
<img src="img/instance-submit.png">
<figcaption>
<i>Image created by Vanessa Sochat for Flux Framework Components documentation</i></figcaption>
</figure>

By leveraging a hierarchy of Flux instances to achieve a divide-and-conquer approach to scheduling, we can exponentially increase throughput. The figure below (from our [learning guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html#fully-hierarchical-resource-management-techniques)) shows this exponential increase in an actual experiment. We submit 500 jobs/second using only a three-level hierarchy, whereas a centralized scheduler (1-Level in the figure) achieves only one 1 job/second.

<figure>
<img src="img/scaled-submit.png">
<figcaption>
<i>Image from <a href="https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html#fully-hierarchical-resource-management-techniques">Flux learning guide</a></i></figcaption>
</figure>

There are different ways to create hierarchies of Flux instances. In this tutorial, we will focus on 2 of them:
1. Nested invocations of `flux batch`
2. The `flux tree` command

### Nested invocations of flux batch

As mentioned in the [Traditional batch scheduling with Flux]() section, `flux batch` is the command used to submit non-interactive, batch script-based jobs to Flux.

The `flux batch` command can be invoked in a nested fashion within a batch script run by another `flux batch` command. When a job submitted with `flux batch` starts running, Flux creates a new Flux instance over the resources reserved for that job. In other words, before starting the script that the user provides, `flux batch` creates a new child in the hierarchy of Flux instances. Since a Flux instance has the same capabilities no matter where it lies in the hierarchy, this newly created instance can schedule its resources in the same way that a system-wide Flux instance can. As a result, the newly created Flux instance can be used to perform additional `flux batch` commands over its subset of the resources.

To show this in action, let's look at `sub_job1.sh` and `sub_job2.sh`.

In [None]:
from IPython.display import Code
Code(filename='sub_job1.sh', language='bash')

In [None]:
from IPython.display import Code
Code(filename='sub_job2.sh', language='bash')

When scheduled with `flux batch`, `sub_job1.sh` will run in a new Flux instance. It will then run `flux batch` again to run `sub_job2.sh`. Because the second `flux batch` command is within `sub_job1.sh`, the job request produced by the second `flux batch` command will go to the scheduler of the child Flux instance instead of the parent Flux instance.

We can see this in action by running the cell below.

In [None]:
!flux batch -N1 ./sub_job1.sh

Once we have submitted `sub_job1.sh`, we can look at the hierarchy for all the jobs we've run using `flux pstree`. Normally, this command can be used to show jobs in a Flux instance. However, since we are running in a Jupyter notebook, this command will have limited functionality. So, instead of just running the single command, we will run `flux pstree -a` to look at **all** jobs. In a more complex environment with more jobs, this command would show a deeper nesting. You can see examples of more complex outputs [here](https://flux-framework.readthedocs.io/en/latest/jobs/hierarchies.html?h=pstree#flux-pstree-command).

In [None]:
!flux pstree -a

### The flux tree command

`flux tree` is a prototype tool that allows you to easily create a hierarchy of Flux instances and submit work to different levels it. Alternatively, it can be thought of as a way to create a nested hierarchy of jobs that scale out.

Let's run the command, look at the output, and talk about it.

In [None]:
!flux tree -T2x2 -J 4 -N 1 -c 4 -o ./tree.out -Q easy:fcfs hostname 
!cat ./tree.out

In the above cell, we run `flux tree` and look at the output file. The flags to `flux tree` do the following:
* `-T2x2`: spawn 2 Flux instances under the current instance and then spawn 2 more Flux instances under each of the other 2 (resulting in 4 leaf instances)
* `-N 1`: deploy this hierarchy across 1 node
* `-c 4`: deploy this hierarchy with 4 cores per node
* `-o ./tree.out`: write performance data for the hierarchy to `./tree.out`
* `-Q easy:fcfs`: use the EASY scheduling policy (backfilling with reservations) in the first level of the hierarchy and use the fcfs policy (first come, first served) in the second (i.e., leaf) level

With these flags, `flux tree` creates the hierarchy shown in the image below, with each leaf-level instance scheduling the `hostname` program.

<figure>
<img src="img/flux-tree.png">
<figcaption>
<i>Image created by Ian Lumsden based on images by Vanessa Sochat</i></figcaption>
</figure>

For this tutorial, we show `flux tree` with a relatively simple job (i.e., `hostname`). However, since this command accepts any valid jobspec that can be recognized by `flux submit`, it can be used to rapidly deploy much more complex scenarios, including scenarios where different programs are run on each leaf-level instance.

# This concludes Module 2.

In this module, we demonstrated how to:
1. Use Flux for traditional batch scheduling similar to what is provided by other schedulers like Slurm
2. Use Flux for hierarchical scheduling to achieve greater scheduling throughput

To continue with the tutorial, open [Module 3](./03_flux_framework.ipynb).