<div>
<center><img src="Flux-logo.svg" width="400"/>
</div>

# Welcome to the Flux Tutorial

> What is Flux Framework? 🤔️
 
Flux is a flexible framework for resource management, built for your site. The framework consists of a suite of projects, tools, and libraries that may be used to build site-custom resource managers for High Performance Computing centers and cloud environments. Flux is a next-generation resource manager and scheduler with many transformative capabilities like hierarchical scheduling and resource management (you can think of it as "fractal scheduling") and directed-graph based resource representations.

> I'm ready! How do I do this tutorial? 😁️

To step through examples in this notebook you need to execute cells. To run a cell, press Shift+Enter on your keyboard. If you prefer, you can also paste the shell commands in the <button data-commandLinker-command="terminal:open" data-name="flux" href="#">JupyterLab terminal</button> and execute them there. This notebook provides the main Flux tutorial, and we have several other modules available:

## I'm ready! How do I do this tutorial? 😁️

This tutorial is split into 3 chapters, each of which has a notebook:
* [Chapter 1: Getting started with Flux](./01_flux_tutorial.ipynb) (you're already here, it's this notebook!)
* [Chapter 2: Flux Plumbing](./02_flux_framework.ipynb)
* [Chapter 3: Lessons learned, next steps, and discussion](./03_flux_tutorial_conclusions.ipynb)

And if you have some extra time and interest, we have supplementary chapters to teach you about advanced (often experimental, or under development) features:

* [Supplementary Chapter 1: Using DYAD to accelerate distributed Deep Learning (DL) training](./supplementary/dyad/dyad_dlio.ipynb)

Let's get started! To provide some brief, added background on Flux and a bit more motivation for our tutorial, "Shift+Enter" the cell below to watch our YouTube video!

In [1]:
%%html
<iframe width="640" height="360" 
    src="https://www.youtube.com/embed/YIwt51dyXOE" 
    title="YouTube video player" 
    frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" 
    allowfullscreen>
</iframe>

# Getting started with Flux

The code and examples that this tutorial is based on can be found at [flux-framework/Tutorials](https://github.com/flux-framework/Tutorials/tree/master/2024-RADIUSS-AWS). You can also find python examples in the `flux-workflow-examples` directory from the sidebar navigation in this JupyterLab instance.

## Resources

> Looking for other resources? We got you covered! 🤓️

 - [https://flux-framework.org/](https://flux-framework.org/) Flux Framework portal for projects, releases, and publication.
 - [Flux Documentation](https://flux-framework.readthedocs.io/en/latest/).
 - [Flux Framework Cheat Sheet](https://flux-framework.org/cheat-sheet/)
 - [Flux Glossary of Terms](https://flux-framework.readthedocs.io/en/latest/glossary.html)
 - [Flux Comics](https://flux-framework.readthedocs.io/en/latest/comics/fluxonomicon.html) come and meet FluxBird - the pink bird who knows things!
 - [Flux Learning Guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html) learn about what Flux does, how it works, and real research applications 
 - [Getting Started with Flux and Go](https://converged-computing.github.io/flux-go/)
 - [Getting Started with Flux in C](https://converged-computing.github.io/flux-c-examples/) *looking for contributors*

To read the Flux manpages and get help, run `flux help`. To get documentation on a subcommand, run, e.g. `flux help config`.  Here is an example of running `flux help` right from the notebook. Yes, did you know we are running in a Flux Instance right now?

In [3]:
!flux help

Usage: flux [OPTIONS] COMMAND ARGS
  -h, --help             Display this message.
  -v, --verbose          Be verbose about environment and command search
  -V, --version          Display command and component versions
  -p, --parent           Set environment of parent instead of current instance

For general Flux documentation, please visit
    https://flux-framework.readthedocs.io

run and submit jobs, allocate resources
   submit             submit a job to a Flux instance
   run                run a Flux job interactively
   bulksubmit         submit jobs in bulk to a Flux instance
   alloc              allocate a new Flux instance for interactive use
   batch              submit a batch script to Flux

list and interact with jobs
   jobs               list jobs submitted to Flux
   top                display running Flux jobs
   pstree             display job hierarchies
   cancel             cancel one or more jobs
   pgrep/pkill        search or cancel matching jobs
   job      

<div class="alert alert-block alert-info">
<span style="font-weight:600">Tip:</span> Did you know you can also get help for a specific command? For example, run, `flux help jobs` to get information on a sub-command.
</div>

In [30]:
!flux help jobs

FLUX-JOBS(1)                       flux-core                      FLUX-JOBS(1)

NAME
       flux-jobs - list jobs submitted to Flux

SYNOPSIS
       flux jobs [OPTIONS] [JOBID ...]

DESCRIPTION
       flux  jobs is used to list jobs run under Flux. By default only pending
       and running jobs for the current user are listed. Additional  jobs  and
       information  can  be  listed  using options listed below.  Alternately,
       specific job ids can be listed on the command line to only  list  those
       job IDs.

OPTIONS
       -a     List  jobs  in  all  states,  including  inactive jobs.  This is
              shorthand for --filter=pending,running,inactive.

       -A     List jobs of all users. This is shorthand for --user=all.

       -n, --no-header
              For default output, do not output column headers.

       -u, --user=[USERNAME|UID]
              List jobs for a specific username or userid. Specify all for all
              users.

       --name=[JOB NAME]
  

### What does the terminal prompt mean?
For cases when you need a terminal, we will <button data-commandLinker-command="terminal:open" data-name="flux" href="#">provide you with a button</button>! However, you can also select `File -> New -> Terminal` to open one on the fly. Let's next talk about flux instances.

# Creating Flux Instances

A Flux instance is a fully functional set of services which manage compute resources under its domain with the capability to launch jobs on those resources. A Flux instance may be running as the default resource manager on a cluster, a job in a resource manager such as Slurm, LSF, or Flux itself, or as a test instance launched locally.

When run as a job in another resource manager, Flux is started like an MPI program, e.g., under Slurm we might run `srun [OPTIONS] flux start [SCRIPT]`. Flux is unique in that a test instance that mimics a multi-node instance can be started locally with simply:

```bash
flux start --test-size=4
```

This offers users to a way to learn and test interfaces and commands without access to an HPC cluster.
To start a Flux session with 4 brokers in your notebook container here, run:

In [5]:
!flux start --test-size=4 flux getattr size

4


When you run `flux start` without a command, it will give you an interactive shell to the instance. When you provide a command (as we do above) it will run it and exit. This is what happens for the command above! The output indicates the number of brokers started successfully. As soon as we get and print the size, we exit.

## Flux Resources

When you are interacting with Flux, you will commonly want to know what resources are available to you. Flux uses [hwloc](https://github.com/open-mpi/hwloc) to detect the resources on each node and then to populate its resource graph.

You can access the topology information that Flux collects with the `flux resource` subcommand. Let's run `flux resource list` to see the resources available to us in this notebook:

In [1]:
!flux resource list

     STATE NNODES   NCORES    NGPUS NODELIST
      free      4       40        0 f5af[12550686,12550686,12550686,12550686]
 allocated      0        0        0 
      down      0        0        0 


Flux can also bootstrap its resource graph based on static input files, like in the case of a multi-user system instance setup by site administrators.  [More information on Flux's static resource configuration files](https://flux-framework.readthedocs.io/en/latest/adminguide.html#resource-configuration).  Flux provides a more standard interface to listing available resources that works regardless of the resource input source: `flux resource`.

In [2]:
# To view status of resources
!flux resource status

     STATE UP NNODES NODELIST
     avail [01;32m ✔[0;0m      4 f5af[12550686,12550686,12550686,12550686]


It might also be the case that you need to see queues. Here is how to do that:

In [32]:
!flux queue list

 DEFAULTTIME  TIMELIMIT     NNODES     NCORES      NGPUS
         inf        inf      0-inf      0-inf      0-inf


# Flux Commands 

Here are how Flux commands map to a scheduler you are likely familiar with, Slurm. A larger table with similar mappings for LSF, Moab, and Slurm can be [viewed here](https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-cross-reference-guides). For submitting jobs, you can use the `flux` `submit`, `run`, `bulksubmit`, `batch`, and `alloc` commands.

<table>
    <tr>
        <th>Operation</th>
        <th>Slurm</th>
        <th>Flux</th>
    </tr>
    <tr>
        <td>One-off run of a single job (blocking)</td>
        <td><code>srun</code></td>
        <td><code>flux run</code></td>
    </tr>
    <tr>
        <td>One-off run of a single job (interactive)</td>
        <td><code>srun --pty</code></td>
        <td><code>flux run -o pty.interactive</code></td>
    </tr>
    <tr>
        <td>One-off run of a single job (not blocking)</td>
        <td><code>NA</code></td>
        <td><code>flux submit</code></td>
    </tr>
    <tr>
        <td>Bulk submission of jobs (not blocking)</td>
        <td><code>NA</code></td>
        <td><code>flux bulksubmit</code></td>
    </tr>    
    <tr>
        <td>Watching jobs</td>
        <td><code>NA</code></td>
        <td><code>flux watch</code></td>
    </tr>
    <tr>
        <td>Querying the status of jobs</td>
        <td><code>squeue</code>/<code>scontrol show job <i>job_id</i></code></td>
        <td><code>flux jobs</code>/<code>flux job info <i>job_id</i></code></td>
    </tr>
    <tr>
        <td>Canceling running jobs</td>
        <td><code>scancel</code></td>
        <td><code>flux cancel</code></td>
    </tr>
        <tr>
        <td>Submitting batch jobs</td>
        <td><code>sbatch</code></td>
        <td><code>flux batch</code></td>
    </tr>
    <tr>
        <td>Allocation for an interactive instance</td>
        <td><code>salloc</code></td>
        <td><code>flux alloc</code></td>
    </tr>
</table>

## flux run

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> One-off run of a single job (blocking)
</div>

The `flux run` command submits a job to Flux (similar to `flux submit`) but then attaches to the job with `flux job attach`, printing the job's stdout/stderr to the terminal and exiting with the same exit code as the job. It's basically doing an interactive submit, because you will be able to watch the output in your terminal, and it will block your terminal until the job completes.

In [5]:
!flux run hostname

399f5da372b0


The output from the previous command is the hostname (a container ID string in this case). If the job exits with a non-zero exit code this will be reported by `flux job attach` (occurs implicitly with `flux run`). For example, execute the following:

In [6]:
!flux run /bin/false

flux-job: task(s) exited with exit code 1


A job submitted with `run` can be canceled with two rapid `Cltr-C`s in succession, or a user can detach from the job with `Ctrl-C Ctrl-Z`. The user can then re-attach to the job by using `flux job attach JOBID`.

`flux submit` and `flux run` also support many other useful flags:

In [7]:
!flux run -n4 --label-io --time-limit=5s --env-remove=LD_LIBRARY_PATH hostname

3: 399f5da372b0
2: 399f5da372b0
1: 399f5da372b0
0: 399f5da372b0


In [13]:
# Uncomment and run this help command if you want to see all the flags for flux run
# !flux run --help

## flux submit

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> One-off run of a single job (not blocking)
</div>


The `flux submit` command submits a job to Flux and prints out the jobid. 

In [4]:
# Let's peek at the help for flux submit!
!flux submit --help | head -n 15

usage: flux submit [OPTIONS...] COMMAND [ARGS...]

enqueue a job

positional arguments:
  command                     Job command and arguments

options:
  -h, --help                  show this help message and exit
  -q, --queue=NAME            Submit a job to a specific named queue
  -t, --time-limit=MIN|FSD    Time limit in minutes when no units provided,
                              otherwise in Flux standard duration, e.g. 30s,
                              2d, 1.5h
      --urgency=N             Set job urgency (0-31), hold=0, default=16,
                              expedite=31


In [2]:
!flux submit hostname

ƒScZH3DbD


`submit` supports common options like `--nnodes`, `--ntasks`, and `--cores-per-task`. There are short option equivalents (`-N`, `-n`, and `-c`, respectively) of these options as well. `--cores-per-task=1` is the default.

In [3]:
!flux submit -N1 -n2 sleep inf

ƒSdrJJshH


## flux bulksubmit

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> Bulk submission of jobs (not blocking)
</div>

The `flux bulksubmit` command enqueues jobs based on a set of inputs which are substituted on the command line, similar to `xargs` and the GNU `parallel` utility, except the jobs have access to the resources of an entire Flux instance instead of only the local system.

In [8]:
!flux bulksubmit --watch --wait echo {} ::: foo bar baz

ƒSqGSA7dh
ƒSqGSA7di
ƒSqGSA7dj
bar
baz
foo


The `--cc` option (akin to "carbon copy") to `submit` makes repeated submission even easier via, `flux submit --cc=IDSET`:

In [None]:
!flux submit --cc=1-4 --watch hostname

Try it in the JupyterLab terminal with a progress bar and jobs/s rate report: `flux submit --cc=1-100 --watch --progress --jps hostname`

Note that `--wait` is implied by `--watch`, meaning that when you are watching jobs, you are also waiting for them to finish.

Of course, Flux can launch more than just single-node, single-core jobs.  We can submit multiple heterogeneous jobs and Flux will co-schedule the jobs while also ensuring no oversubscription of resources (e.g., cores).

Note: in this tutorial, we cannot assume that the host you are running on has multiple cores, thus the examples below only vary the number of nodes per job.  Varying the `cores-per-task` is also possible on Flux when the underlying hardware supports it (e.g., a multi-core node).

In [10]:
!flux submit --nodes=2 --ntasks=2 --cores-per-task=1 --job-name simulation sleep inf
!flux submit --nodes=1 --ntasks=1 --cores-per-task=1 --job-name analysis sleep inf

ƒT2VDkrT1
ƒT2azp48w


## flux watch

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> 👀️ Watching jobs
</div>

Wouldn't it be cool to submit a job and then watch it? Well, yeah! We can do this now with flux watch. Let's run a fun example, and then watch the output. We have sleeps in here interspersed with echos only to show you the live action! 🥞️
Also note a nice trick - you can always use `flux job last` to get the last JOBID.
Here is an example (not runnable, as notebooks don't support environment variables) for getting and saving a job id:

```bash
flux submit hostname
JOBID=$(flux job last)
```

And then you could use the variable `$JOBID` in your subsequent script or interactions with Flux! So what makes `flux watch` different from `flux job attach`? Aside from the fact that `flux watch` is read-only, `flux watch` can watch many (or even all (`flux watch --all`) jobs at once!

In [11]:
!flux submit ./flux-workflow-examples/job-watch/job-watch.sh
!flux watch $(flux job last)

ƒTR3HXBfD
25 chocolate chip pancakes on the table... 25 chocolate chip pancakes! 🥞️
Eat a stack, for a snack, 15 chocolate chip pancakes on the table! 🥄️
15 chocolate chip pancakes on the table... 15 chocolate chip pancakes! 🥞️
Throw a stack... it makes a smack! 15 chocolate chip pancakes on the wall! 🥞️
You got some cleaning to do 🧽️


## flux jobs

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> Querying the status of jobs
</div>

We can now list the jobs in the queue with `flux jobs` and we should see both jobs that we just submitted. Jobs that are instances are colored blue in output, red jobs are failed jobs, and green jobs are those that completed successfully. Note that the JupyterLab notebook may not display these colors. You will be able to see them in the terminal.

In [12]:
!flux jobs

       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
   ƒT2azp48w jovyan   analysis    R      1      1   1.267m 399f5da372b0
   ƒT2VDkrT1 jovyan   simulation  R      2      2   1.271m 399f5da372b[0,0]
   ƒSdrJJshH jovyan   sleep       R      2      1   2.127m 399f5da372b0


You might also want to see "all" jobs with `-a`.

In [None]:
!flux jobs -a

## flux cancel

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> Canceling running jobs
</div>

Since some of the jobs we see in the table above won't ever exit (and we didn't specify a timelimit), let's cancel them all now and free up the resources.

In [13]:
# This was previously flux cancelall -f
!flux cancel --all
!flux jobs

flux-cancel: Canceled 3 jobs (0 errors)
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO


## flux batch

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> Submitting batch jobs
</div>

We can use the `flux batch` command to easily created nested flux instances.  When `flux batch` is invoked, Flux will automatically create a nested instance that spans the resources allocated to the job, and then Flux runs the batch script passed to `flux batch` on rank 0 of the nested instance. "Rank" refers to the rank of the Tree-Based Overlay Network (TBON) used by the [Flux brokers](https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man1/flux-broker.html).

While a batch script is expected to launch parallel jobs using `flux run` or `flux submit` at this level, nothing prevents the script from further batching other sub-batch-jobs using the `flux batch` interface, if desired.

In [14]:
!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh
!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh

ƒThKfdhKD
ƒThRLkwsm


The contents of `sleep_batch.sh`:

In [15]:
from IPython.display import Code
Code(filename='sleep_batch.sh', language='bash')

In [16]:
# Here we are submitting a job that generates output, and asking to write it to /tmp/cheese.txt
!flux submit --out /tmp/cheese.txt echo "Sweet dreams 🌚️ are made of cheese, who am I to diss a brie? 🧀️"

# This will show us JOBIDs
!flux jobs

# We can even see jobs in sub-instances with "-R" (for recursive)
!flux jobs -R

# You could copy a JOBID from above and paste it in the line below to examine the job's resources and output
# or get the last jobid with "flux job last" (this is what we will do here)
# JOBID="ƒFoRYVpt7"

# Note here we are using flux job last to see the last one
# The "R" here asks for the resource spec
!flux job info $(flux job last) R

# When we attach it will direct us to our output file
!flux job attach $(flux job last)

# And we can look at the output file to see our expected output!
from IPython.display import Code
Code(filename='/tmp/cheese.txt', language='text')

ƒU5u1XQcf
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
[01;34m   ƒThRLkwsm jovyan   ./sleep_b+  R      2      2   51.17s 399f5da372b[0,0]
[0;0m[01;34m   ƒThKfdhKD jovyan   ./sleep_b+  R      2      2   51.39s 399f5da372b[0,0]
[0;0m       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
[01;34m   ƒThRLkwsm jovyan   ./sleep_b+  R      2      2   51.34s 399f5da372b[0,0]
[0;0m[01;34m   ƒThKfdhKD jovyan   ./sleep_b+  R      2      2   51.56s 399f5da372b[0,0]
[0;0m
ƒThRLkwsm:
    ƒEgNEfjm jovyan   sleep       R      2      2   20.11s 399f5da372b[0,0]

ƒThKfdhKD:
    ƒEga6ZzX jovyan   sleep       R      2      2   20.32s 399f5da372b[0,0]
{"version": 1, "execution": {"R_lite": [{"rank": "3", "children": {"core": "7"}}], "nodelist": ["399f5da372b0"], "starttime": 1721424338, "expiration": 4875020774}}
0: stdout redirected to /tmp/cheese.txt
0: stderr redirected to /tmp/cheese.txt


We can again see a list all completed jobs with `flux jobs -a`:

In [None]:
!flux jobs -a

To restrict the output to failed (i.e., jobs that exit with nonzero exit code, time out, or are canceled or killed) jobs, run:

In [18]:
!flux jobs -f failed

       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
[01;31m   ƒSixhuHXu jovyan   false       F      1      1   0.070s 399f5da372b0
[0;0m

## flux alloc

<div class="alert alert-block" style="background-color:skyblue">
<span style="font-weight:600">Description:</span> Allocation for an interactive instance
</div>

You might want to request an allocation for a set of resources (an allocation) and then attach to the interactively. This is the goal of flux alloc. Since we can't easily do that in a cell, try opening up the <button data-commandLinker-command="terminal:open" data-name="flux" href="#">JupyterLab terminal</button> and doing: 

```bash
# Look at the resources you have outside of the allocation
flux resource list

# Request an allocation with 2 "nodes" - a subset of what you have in total
flux alloc -N 2

# See the resources you are given
flux resource list

# You can exit from the allocation like this!
exit
```

# The Flux Hierarchy 🍇️

One feature of the Flux Framework scheduler that is unique is its ability to submit jobs within instances, where an instance can be thought of as a level in a graph. Let's start with a basic image - this is what it might look like to submit to a scheduler that is not graph-based,
where all jobs go to a central job queue or database. Note that our maximum job throughput is one job per second.

![img/single-submit.png](img/single-submit.png)

The throughput is limited by the workload manager's ability to process a single job. We can improve upon this by simply adding another level, perhaps with three instances. For example, let's say we create a flux allocation or batch that has control of some number of child nodes. We might launch three new instances (each with its own scheduler and queue) at that level two, and all of a sudden, we get a throughput of 1x3, or three jobs per second. 

![img/instance-submit.png](img/instance-submit.png)


All of a sudden, the throughout can increase exponentially because we are essentially submitting to different schedulers. The example above is not impressive, but our [learning guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html#fully-hierarchical-resource-management-techniques) (Figure 10) has a beautiful example of how it can scale, done via an actual experiment. We were able to submit 500 jobs/second using only three levels, vs. close to 1 job/second with one level. 

![img/scaled-submit.png](img/scaled-submit.png)

And for an interesting detail, you can vary the scheduler algorithm or topology within each sub-instance, meaning that you can do some fairly interesting things with scheduling work, and all without stressing the top level system instance. Next, let's look at a prototype tool called `flux-tree` that you can use to see how this works.

## flux tree

Flux tree is a prototype tool that allows you to easily submit work to different levels of your flux instance, or more specifically, creating a nested hierarchy of jobs that scale out. Let's run the command, look at the output, and talk about it.

In [1]:
!flux tree -T2x2 -J 4 -N 1 -c 4 -o ./tree.out -Q easy:fcfs hostname
! cat ./tree.out

flux-job: task(s) exited with exit code 1
flux-job: task(s) exited with exit code 1
awk: line 1: syntax error at or near :
e8cfed35f636
flux-tree-helper: ERROR: Expecting value: line 1 column 160 (char 159)
Jul 05 05:20:32.333883 UTC broker.err[0]: rc2.0: flux tree -N1 -c1 --leaf --prefix=tree.1.1 --njobs=1 -- hostname Exited (rc=1) 0.6s
awk: line 1: syntax error at or near :
flux-tree-helper: ERROR: Expecting value: line 1 column 156 (char 155)
Jul 05 05:20:33.523886 UTC broker.err[0]: rc2.0: flux tree -N1 -c2 --topology=2 --queue-policy=fcfs --prefix=tree.1 --njobs=2 -- hostname Exited (rc=1) 2.4s
awk: line 1: syntax error at or near :
flux-tree-helper: ERROR: Expecting value: line 1 column 155 (char 154)
cat: ./tree.out: No such file or directory


In the above, we are running `flux-tree` and looking at the output file. What is happening is that the `flux tree` command is creating a hierarchy of instances. Based on their names you can tell that:

 - `2x2` in the command is the topology
 - It says to create two flux instances, and make them each spawn two more.
 - `tree` is the root
 - `tree.1` is the first instance
 - `tree.2` is the second instance
 - `tree.1.1` and `tree.1.2` refer to the nested instances under `tree.1`
 - `tree.2.1` and `tree.2.2` refer to the nested instances under `tree.2`
 
And we provided the command `hostname` to this script, but a more complex example would generate more interested hierarchies,
and with differet functionality for each. Note that although this is just a dummy prototype, you could use `flux-tree` for actual work,
or more likely, you would want to use `flux batch` to submit multiple commands within a single flux instance to take advantage of the same
hierarchy. 

## flux batch

Let's return to flux batch, but now with our new knowledge about flux instances! Flux tree is actually an experimental command that you won't encounter in the wild. Instead, you will likely interact with your nested flux instances with `flux batch`. Let's start with a batch script `hello-batch.sh`.

### hello-batch.sh


In [19]:
from IPython.display import Code
Code(filename='hello-batch.sh', language='bash')

We would provide this script to run with `flux batch` that is going to:

1. Create a flux instance with the top level resources you specify
2. Submit jobs to the scheduler controlled by the broker of that sub-instance
3. Run the four jobs, with `--flags=waitable` and `flux job wait --all` to wait for the output file
4. Within the batch script, you can add `--wait` or `--flags=waitable` to individual jobs, and use `flux queue drain` to wait for the queue to drain, _or_ `flux job wait --all` to wait for the jobs you flagged to finish. 

Note that when you submit a batch job, you'll get a job id back for the _batch job_, and usually when you look at the output of that with `flux job attach $jobid` you will see the output file(s) where the internal contents are written. Since we want to print the output file easily to the terminal, we are waiting for the batch job by adding the `--flags=waitable` and then waiting for it. Let's try to run our batch job now.

In [20]:
! flux batch --flags=waitable --out /tmp/flux-batch.out -N2 ./hello-batch.sh
! flux job wait
! cat /tmp/hello-batch-1.out
! cat /tmp/hello-batch-2.out
! cat /tmp/hello-batch-3.out
! cat /tmp/hello-batch-4.out

ƒY424FfiX
ƒY424FfiX
Hello job 1 from 399f5da372b0 💛️
Hello job 2 from 399f5da372b0 💚️
Hello job 3 from 399f5da372b0 💙️
Hello job 4 from 399f5da372b0 💜️


Excellent! Now let's look at another batch example. Here we have two job scripts:

- sub_job1.sh: Is going to be run with `flux batch` and submit sub_job2.sh
- sub_job2.sh: Is going to be submit by sub_job1.sh.

You can see that below.

In [31]:
Code(filename='sub_job1.sh', language='bash')

In [32]:
Code(filename='sub_job2.sh', language='bash')

In [22]:
# Submit it!
!flux batch -N1 ./sub_job1.sh

ƒYRzZSFuy


And now that we've submit, let's look at the hierarchy for all the jobs we just ran. Here is how to try flux pstree, which normally can show jobs in an instance, but it has limited functionality given we are in a notebook! So instead of just running the single command, let's add "-a" to indicate "show me ALL jobs."
More complex jobs and in a different environment would have deeper nesting. You can [see examples here](https://flux-framework.readthedocs.io/en/latest/jobs/hierarchies.html?h=pstree#flux-pstree-command).

In [23]:
!flux pstree -a

.
├── ./sub_job1.sh
│   └── ./sub_job2.sh
│       └── sleep:R
├── ./hello-batch.sh:CD
├── 2*[./sleep_batch.sh:CD]
├── 4*[echo:CD]
├── sleep:CA
├── simulation:CA
├── analysis:CA
├── job-watch.sh:CD
├── 13*[hostname:CD]
└── false:F


You can also try a more detailed view with `flux pstree -a -X`!

# Process and Job Utilities ⚙️

## Flux uptime

Did someone say... [uptime](https://youtu.be/SYRlTISvjww?si=zDlvpWbBljUmZw_Q)? ☝️🕑️🕺️

Flux provides an `uptime` utility to display properties of the Flux instance such as state of the current instance, how long it has been running, its size and if scheduling is disabled or stopped. The output shows how long the instance has been up, the instance owner, the instance depth (depth in the Flux hierarchy), and the size of the instance (number of brokers).

In [31]:
!flux uptime

 22:16:17 run 1.8h,  owner jovyan,  depth 0,  size 4


## Flux top 
Flux provides a feature-full version of `top` for nested Flux instances and jobs. In the JupyterLab terminal, invoke `flux top` to see the "sleep" jobs. If they have already completed you can resubmit them. 

We recommend not running `flux top` in the notebook as it is not designed to display output from a command that runs continuously.

## Flux pstree 
In analogy to `top`, Flux provides `flux pstree`. Try it out in the <button data-commandLinker-command="terminal:open" data-name="flux" href="#">JupyterLab terminal</button> or here in the notebook.

## Flux proxy

### flux proxy

> To interact with a job hierarchy

Flux proxy is used to route messages to and from a Flux instance. We can use `flux proxy` to connect to a running Flux instance and then submit more nested jobs inside it. You may want to edit `sleep_batch.sh` with the JupyterLab text editor (double click the file in the window on the left) to sleep for `60` or `120` seconds. Then from the <button data-commandLinker-command="terminal:open" data-name="flux" href="#">JupyterLab terminal</button> run the commands below!

```bash
# The terminal will start at the root, ensure you are in the right spot!
# jovyan - that's you! 
cd /home/jovyan/flux-radiuss-tutorial-2023/notebook/

# Outputs the JOBID
flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh

# Put the JOBID into an environment variable
JOBID=$(flux job last)

# See the flux process tree
flux pstree -a

# Connect to the Flux instance corresponding to JOBID above
flux proxy ${JOBID}

# Note the depth is now 1 and the size is 2: we're one level deeper in a Flux hierarchy and we have only 2 brokers now.
flux uptime

# This instance has 2 "nodes" and 2 cores allocated to it
flux resource list

# Have you used the top command in your terminal? We have one for flux!
flux top
```

`flux top` was pretty cool, right? 😎️

# Python Submission API 🐍️
Flux also provides first-class python bindings which can be used to submit jobs programmatically. The following script shows this with the `flux.job.submit()` call:

In [24]:
import os
import json
import flux
from flux.job import JobspecV1
from flux.job.JobID import JobID

In [25]:
f = flux.Flux() # connect to the running Flux instance
compute_jobreq = JobspecV1.from_command(
    command=["./compute.py", "120"], num_tasks=1, num_nodes=1, cores_per_task=1
) # construct a jobspec
compute_jobreq.cwd = os.path.expanduser("~/flux-tutorial/flux-workflow-examples/job-submit-api/") # set the CWD
print(JobID(flux.job.submit(f,compute_jobreq)).f58) # submit and print out the jobid (in f58 format)

ƒZoXw7Pdq


### `flux.job.get_job(handle, jobid)` to get job info

In [26]:
# This is a new command to get info about your job from the id!
fluxjob = flux.job.submit(f,compute_jobreq)
fluxjobid = JobID(fluxjob.f58)
print(f"🎉️ Hooray, we just submitted {fluxjobid}!")

# Here is how to get your info. The first argument is the flux handle, then the jobid
jobinfo = flux.job.get_job(f, fluxjobid)
print(json.dumps(jobinfo, indent=4))

🎉️ Hooray, we just submitted ƒZrqPNeNb!
{
    "t_depend": 1721425098.682836,
    "t_run": 0.0,
    "t_cleanup": 0.0,
    "t_inactive": 0.0,
    "duration": 0.0,
    "expiration": 0.0,
    "name": "compute.py",
    "cwd": "/home/jovyan/flux-tutorial/flux-workflow-examples/job-submit-api/",
    "queue": "",
    "project": "",
    "bank": "",
    "ntasks": 1,
    "ncores": 1,
    "nnodes": 1,
    "priority": 16,
    "ranks": "",
    "nodelist": "",
    "success": "",
    "result": "",
    "waitstatus": "",
    "id": 72552617607168,
    "t_submit": 1721425098.6718118,
    "t_remaining": 0.0,
    "state": "SCHED",
    "username": "jovyan",
    "userid": 1000,
    "urgency": 16,
    "runtime": 0.0,
    "status": "SCHED",
    "returncode": "",
    "dependencies": [],
    "annotations": {},
    "exception": {
        "occurred": "",
        "severity": "",
        "type": "",
        "note": ""
    }
}


In [27]:
!flux jobs -a | grep compute

   ƒZrqPNeNb jovyan   compute.py  F      1      1   0.009s 399f5da372b0
   ƒZoXw7Pdq jovyan   compute.py  F      1      1   0.011s 399f5da372b0


Under the hood, the `Jobspec` class is creating a YAML document that ultimately gets serialized as JSON and sent to Flux for ingestion, validation, queueing, scheduling, and eventually execution.  We can dump the raw JSON jobspec that is submitted, where we can see the exact resources requested and the task set to be executed on those resources.

In [28]:
print(compute_jobreq.dumps(indent=2))

{
  "resources": [
    {
      "type": "node",
      "count": 1,
      "with": [
        {
          "type": "slot",
          "count": 1,
          "with": [
            {
              "type": "core",
              "count": 1
            }
          ],
          "label": "task"
        }
      ]
    }
  ],
  "tasks": [
    {
      "command": [
        "./compute.py",
        "120"
      ],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    }
  ],
  "attributes": {
    "system": {
      "duration": 0,
      "cwd": "/home/jovyan/flux-tutorial/flux-workflow-examples/job-submit-api/"
    }
  },
  "version": 1
}


### `flux.job.JobspecV1` to create job specifications

Flux represents work as a standard called the [Jobspec](https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_25.html). While you could write YAML or JSON, it's much easier to use provided Python functions that take high level metadata (command, resources, etc) to generate them. We can then replicate our previous example of submitting multiple heterogeneous jobs using these Python helpers, and testing that Flux co-schedules them.

In [40]:
# Here we create our job specification from a command
compute_jobreq = JobspecV1.from_command(
    command=["./compute.py", "120"], num_tasks=4, num_nodes=2, cores_per_task=2
)

# This is the "current working directory" (cwd)
compute_jobreq.cwd = os.path.expanduser("~/flux-workflow-examples/job-submit-api/")
print(JobID(flux.job.submit(f, compute_jobreq)))

# Here is a second I/O job
io_jobreq = JobspecV1.from_command(
    command=["./io-forwarding.py", "120"], num_tasks=1, num_nodes=1, cores_per_task=1
)
io_jobreq.cwd = os.path.expanduser("~/flux-workflow-examples/job-submit-api/")
print(JobID(flux.job.submit(f, io_jobreq)))

ƒKf7Aiz4P
ƒKf7rkefh


In [41]:
!flux jobs -a | grep compute

   ƒKf7Aiz4P jovyan   compute.py  R      4      2   2.727s 7db0bdd6f[967,967]
   ƒKNequHnF jovyan   compute.py  F      4      2   0.012s 7db0bdd6f[967,967]
   ƒKLZWG53M jovyan   compute.py  F      1      1   0.037s 7db0bdd6f967
   ƒKKdeYAGo jovyan   compute.py  F      1      1   0.012s 7db0bdd6f967


### `FluxExecutor` for bulk submission

We can use the FluxExecutor class to submit large numbers of jobs to Flux. This method uses python's `concurrent.futures` interface. Here is an example snippet from `~/flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py`:

``` python 
with FluxExecutor() as executor:
        compute_jobspec = JobspecV1.from_command(args.command)
        futures = [executor.submit(compute_jobspec) for _ in range(args.njobs)]
        # wait for the jobid for each job, as a proxy for the job being submitted
        for fut in futures:
            fut.jobid()
        # all jobs submitted - print timings
```

In [29]:
# Submit a FluxExecutor based script.
%run ./flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py -n200 /bin/sleep 0

bulksubmit_executor: submitted 200 jobs in 0.24s. 831.05job/s
bulksubmit_executor: First job finished in about 0.254s
|██████████████████████████████████████████████████████████| 100.0% (278.2 job/s)
bulksubmit_executor: Ran 200 jobs in 0.9s. 221.8 job/s


# Deeper Dive into Flux Internals 🧐️

## flux queue

Flux has a command for controlling the queue within the `job-manager`: `flux queue`.  This includes disabling job submission, re-enabling it, waiting for the queue to become idle or empty, and checking the queue status:

In [46]:
!flux queue disable "maintenance outage"
!flux queue enable
!flux queue -h

Job submission is disabled: maintenance outage
Job submission is enabled
usage: flux-queue [-h] {status,list,enable,disable,start,stop,drain,idle} ...

optional arguments:
  -h, --help            show this help message and exit

subcommands:

  {status,list,enable,disable,start,stop,drain,idle}


## flux getattr

> Get attributes about your system and environment

Each Flux instance has a set of attributes that are set at startup that affect the operation of Flux, such as `rank`, `size`, and `local-uri` (the Unix socket usable for communicating with Flux).  Many of these attributes can be modified at runtime, such as `log-stderr-level` (1 logs only critical messages to stderr while 7 logs everything, including debug messages). Here is an example set that you might be interested in looking at:

In [47]:
!flux getattr rank
!flux getattr size
!flux getattr local-uri
!flux setattr log-stderr-level 3
!flux lsattr -v

0
4
local:///tmp/flux-rWMT6G/local-0
broker.boot-method                      simple
broker.critical-ranks                   0-1
broker.mapping                          [[0,1,4,1]]
broker.pid                              8
broker.quorum                           4
broker.quorum-timeout                   1m
broker.rc1_path                         /etc/flux/rc1
broker.rc3_path                         /etc/flux/rc3
broker.starttime                        1712894811.07
conf.shell_initrc                       /etc/flux/shell/initrc.lua
conf.shell_pluginpath                   /usr/lib/flux/shell/plugins
config.path                             -
content.backing-module                  content-sqlite
content.hash                            sha1
hostlist                                993a4f[746854,746854,746854,746854]
instance-level                          0
jobid                                   -
local-uri                               local:///tmp/flux-rWMT6G/local-0
log-critical-level   

## flux module

Services within a Flux instance are implemented by modules. To query and manage broker modules, use `flux module`.  Modules that we have already directly interacted with in this tutorial include `resource` (via `flux resource`), `job-ingest` (via `flux` and the Python API) `job-list` (via `flux jobs`) and `job-manager` (via `flux queue`), and we will interact with the `kvs` module in a few cells. For the most part, services are implemented by modules of the same name (e.g., `kvs` implements the `kvs` service and thus the `kvs.lookup` RPC).  In some circumstances, where multiple implementations for a service exist, a module of a different name implements a given service (e.g., in this instance, `sched-fluxion-qmanager` provides the `sched` service and thus `sched.alloc`, but in another instance `sched-simple` might provide the `sched` service).

In [48]:
!flux module list

Module                   Idle  S Service
job-info                    4  R 
sched-fluxion-resource      4  R 
heartbeat                   1  R 
job-manager                 1  R 
connector-local             0  R 
content-sqlite              2  R content-backing
kvs                         2  R 
resource                    2  R 
kvs-watch                   4  R 
job-exec                    4  R 
barrier                  idle  R 
job-list                    4  R 
content                     2  R 
sched-fluxion-qmanager      4  R sched
cron                     idle  R 
job-ingest                  4  R 


See the [Flux Management Notebook](02_flux_framework.ipynb) for a small tutorial of unloading and reloading the Fluxion (flux scheduler) modules.

## flux dmesg

If you need some additional help debugging your Flux setup, you might be interested in `flux dmesg`, which is akin to the [Linux dmesg](https://man7.org/linux/man-pages/man1/dmesg.1.html) but delivers messages for Flux.

In [7]:
!flux dmesg

## flux exec

Flux provides a built-in mechanism for executing commands on nodes without requiring a job or resource allocation: `flux exec`.  `flux exec` is typically used by sys admins to execute administrative commands and load/unload modules across multiple ranks simultaneously.

In [52]:
!flux exec -r 2 flux getattr rank # only execute on rank 2

2


In [53]:
!flux exec flux getattr rank # execute on all ranks

0
1
3
2


# This concludes Chapter 1! 📗️

In this module, we covered:
1. Submitting jobs with Flux
2. The Flux Hierarchy
3. Flux Process and Job Utilities
4. Deeper Dive into Flux Internals

To continue with the tutorial, open [Chapter 2](./02_flux_framework.ipynb)