Skip to content

Commit

Permalink
installing flux view an on demand volume is almost working! (#68)
Browse files Browse the repository at this point in the history
* installing flux view an on demand volume is almost working!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Sep 27, 2023
1 parent 8fdddab commit f8dbe0e
Show file tree
Hide file tree
Showing 14 changed files with 911 additions and 140 deletions.
5 changes: 5 additions & 0 deletions docs/_static/data/addons.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,10 @@
"name": "volume-secret",
"description": "secret volume type",
"family": "volume"
},
{
"name": "workload-flux",
"description": "hierarchical graph-based scheduler and resource manager",
"family": "workload"
}
]
57 changes: 57 additions & 0 deletions docs/getting_started/addons.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,63 @@ spec:

**Note that we have support for a custom application container, but haven't written any good examples yet!**

## Workload

### workload-flux

If you need to "throw in" Flux Framework into your container to use as a scheduler, you can do that with an addon!

> Yes, it's astounding. 🦩️
This works by way of the same trick that we use for other addons that have a complex (and/or large) install setup. We:

- Build the software into an isolated spack "copy" view
- The software is then (generally) at some `/opt/view` and `/opt/software`
- The flux container is added as a sidecar container to your pod for your replicated job
- Additional setup / configuration is done here
- We can then create an empty volume that is shared by your metric or scaled application
- The entire tree is copied over into the empty volume
- When the copy is done, indicated by the final touch of a file, the updated container entrypoint is run
- This typically means we have taken your metric command, and wrapped it in a Flux submit.

It's really cool because it means you can run a metric / application with Flux without needing
to install it into your container to begin with. The one important detail is a matching of
general operating system. The current view uses rocky, however the image is customizable
(and we can provide other bases if/when requested). Here are the arguments you can customize
under the metric -> options.

| Name | Description | Type | Default |
|-----|-------------|------------|------|
| mount | Path to mount flux view in application container | string | /opt/share |
| tasks | Number of tasks `-n` to give to flux (not provided if not set) | string | unset |
| image | Customize the container image | string | `ghcr.io/rse-ops/spack-flux-rocky-view:tag-8` |
| fluxUser | The flux user (currently not used, but TBA) | string | flux |
| fluxUid | The flux user ID (currently not used, but TBA) | string | 1004 |
| interactive | Run flux in interactive mode | string | "false" |
| connectTimeout | How long zeroMQ should wait to retry | string | "5s" |
| quorum | The number of brokers to require before starting the cluster | string | (total brokers or pods) |
| debugZeroMQ | Turn on zeroMQ debugging | string | "false" |
| logLevel | Customize the flux log level | string | "6" |
| queuePolicy | Queue policy for flux to use | string | fcfs |
| workerLetter | The letter that the worker job is expected to have | string | w |
| launcherLetter | The letter that the launcher job is expected to have | string | w |
| workerIndex | The index of the replicated job for the worker | string | 0 |
| launcherIndex | The index of the replicated job for the launcher | string | 0 |
| preCommand | Pre-command logic to run in launcher/workers before flux is started (after setup in flux container) | string | unset |

Note that the number of pods for flux defaults to the number in your MetricSet, along
with the namespace and service name.

**Important** the flux addon is currently supported for metric types that:

1. have the launcher / worker design (so the hostlist.txt is present in the PWD)
2. Have scp installed, as the shared certificate needs to be copied from the lead broker to all followers
3. Ideally have munge installed - we do try to install it (but better to already be there)

We also currently run flux as root. This is considered bad practice, but probably OK
for this early development work. We don't see a need to have shared namespace / operator
environments at this point, which is why I didn't add it.

## Performance

### perf-hpctoolkit
Expand Down
32 changes: 32 additions & 0 deletions examples/addons/flux-lammps/metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 4
logging:
interactive: true

metrics:

# Running more scaled lammps is our main goal
- name: app-lammps

# This flux addon is built on rocky, and we can provide additional os bases
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky

options:
command: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS

# Add on hpctoolkit, will mount a volume and wrap lammps
addons:
- name: workload-flux
options:
# Ensure intel environment is setup
preCommand: . /opt/intel/mpi/latest/env/vars.sh
workdir: /opt/lammps/examples/reaxff/HNS
9 changes: 5 additions & 4 deletions pkg/addons/addons.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ var (
AddonFamilyPerformance = "performance"
AddonFamilyVolume = "volume"
AddonFamilyApplication = "application"
AddonFamilyWorkload = "workload"
)

// A general metric is a container added to a JobSet
Expand All @@ -37,7 +38,7 @@ type Addon interface {
Description() string

// Options and exportable attributes
SetOptions(*api.MetricAddon)
SetOptions(*api.MetricAddon, *api.MetricSet)
Options() map[string]intstr.IntOrString
ListOptions() map[string][]intstr.IntOrString
MapOptions() map[string]map[string]intstr.IntOrString
Expand Down Expand Up @@ -65,7 +66,7 @@ type AddonBase struct {
mapOptions map[string]map[string]intstr.IntOrString
}

func (b *AddonBase) SetOptions(metric *api.MetricAddon) {}
func (b *AddonBase) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {}
func (b *AddonBase) CustomizeEntrypoints([]*specs.ContainerSpec, []*jobset.ReplicatedJob) {}

func (b *AddonBase) Validate() bool {
Expand Down Expand Up @@ -97,7 +98,7 @@ func (b *AddonBase) MapOptions() map[string]map[string]intstr.IntOrString {
}

// GetAddon looks up and validates an addon
func GetAddon(a *api.MetricAddon) (Addon, error) {
func GetAddon(a *api.MetricAddon, set *api.MetricSet) (Addon, error) {

// We don't want to change the addon interface/struct itself
template, ok := Registry[a.Name]
Expand All @@ -111,7 +112,7 @@ func GetAddon(a *api.MetricAddon) (Addon, error) {
addon := reflect.New(templateType.Type()).Interface().(Addon)

// Set options before validation
addon.SetOptions(a)
addon.SetOptions(a, set)

// Validate the addon
if !addon.Validate() {
Expand Down
8 changes: 4 additions & 4 deletions pkg/addons/commands.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ func (a *PerfAddon) CustomizeEntrypoints(
}
}

func (a *PerfAddon) SetOptions(metric *api.MetricAddon) {
func (a *PerfAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.Identifier = perfCommandsName
a.SetSharedCommandOptions(metric)
a.SetSharedCommandOptions(addon)
}

// addContainerCaps adds capabilities to a container spec
Expand Down Expand Up @@ -102,9 +102,9 @@ func (m CommandAddon) Family() string {
return AddonFamilyApplication
}

func (a *CommandAddon) SetOptions(metric *api.MetricAddon) {
func (a *CommandAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.Identifier = commandsName
a.SetSharedCommandOptions(metric)
a.SetSharedCommandOptions(addon)
}

// Set custom options / attributes for the metric
Expand Down
4 changes: 2 additions & 2 deletions pkg/addons/containers.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,8 @@ func (a *ApplicationAddon) setDefaultEntrypoint() {
}

// Calling the default allows a custom application that uses this to do the same
func (a *ApplicationAddon) SetOptions(metric *api.MetricAddon) {
a.SetDefaultOptions(metric)
func (a *ApplicationAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.SetDefaultOptions(addon)
}

// Underlying function that can be shared
Expand Down

0 comments on commit f8dbe0e

Please sign in to comment.