Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
PROJECT_NAME="airstack"
# If you've run ./airstack.sh setup, then this will auto-generate from the git commit hash every time a change is made
# to a Dockerfile or docker-compose.yaml file. Otherwise this can also be set explicitly to make a release version.
VERSION="0.18.0-alpha.6"
VERSION="0.18.0-alpha.7"
# Choose "dev" or "prebuilt". "dev" is for mounted code that must be built live. "prebuilt" is for built ros_ws baked into the image
DOCKER_IMAGE_BUILD_MODE="dev"
# Where to push and pull images from. Can replace with your docker hub username if using docker hub.
Expand Down
153 changes: 153 additions & 0 deletions .github/orchestrator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# AirStack CI Orchestrator

Long-running service that watches GitHub for queued workflow jobs and spawns truly ephemeral OpenStack instances to execute each one. The orchestrator VM is the only host that holds the GitHub PAT and the OpenStack credential; the workers are destroyed after a single job.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator VM (airstack-ci-cd-orchestrator) │
│ │
│ airstack-orchestrator.service → orchestrator.py │
│ spawn loop (every 15s): │
│ • GET /repos/<repo>/actions/runs?status=queued │
│ • POST /repos/<repo>/actions/runners/generate-jitconfig│
│ • openstack server create (image, flavor, user_data) │
│ • record (job_id → server_id) in state.json │
│ reap loop (every 30s): │
│ • job completed → openstack server delete │
│ • job age > N min → force delete (straggler) │
│ • owned but not in state → orphan reap │
│ │
│ /etc/airstack-orchestrator/ │
│ config.yaml │
│ github-pat │
│ /home/orchestrator/.config/openstack/clouds.yaml │
│ /var/lib/airstack-orchestrator/state.json │
└─────────┬─────────────────────────────────┬─────────────────┘
│ Nova / Neutron API │ GitHub REST API
▼ ▼
┌──────────────────────────────────┐ ┌──────────────────────┐
│ Ephemeral worker (per job) │ │ GitHub Actions │
│ Image: Ubuntu-24.04-GPU-Headless│ │ workflow_job queue │
│ cloud-init: │ └──────────────────────┘
│ install docker + nv toolkit │
│ download GH runner │
│ run.sh --jitconfig <token> │
│ shutdown -h +1 │
└──────────────────────────────────┘
```

Key properties:

- **Truly ephemeral**: every job runs on a clean VM. No Docker layer cache pollution, no leftover networks, no carry-over from prior runs.
- **PAT isolation**: the GitHub PAT lives only on the orchestrator. Workers receive a single-use [JIT runner config](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-a-repository) — a base64 token bound to one runner registration, valid only for a short window.
- **Application-credential auth**: the orchestrator authenticates to OpenStack with an application credential (revocable, scoped, no password), not the user's `openrc.sh`.
- **Crash-safe reaping**: every server we spawn is tagged with `airstack-role=ephemeral-runner`. The reap loop force-deletes any owned server not present in `state.json`, so a crashed orchestrator can't leak instances.

## One-time setup

### 1. Create OpenStack application credential

On your local workstation (not the orchestrator VM):

```bash
source ~/.airlabcloud/openrc.sh
openstack application credential create airstack-orchestrator \
--description "AirStack CI orchestrator — spawns ephemeral test runners"
```

The output prints `id` and `secret`. Build a `clouds.yaml`:

```yaml
clouds:
airstack:
auth_type: v3applicationcredential
auth:
auth_url: https://airlab-cloud.andrew.cmu.edu:5000/v3/
application_credential_id: <id from above>
application_credential_secret: <secret from above>
region_name: Airlab
interface: public
identity_api_version: 3
```

### 2. Stage credentials on the orchestrator VM

```bash
# clouds.yaml: install for the orchestrator user (created in step 3)
scp clouds.yaml ubuntu@<orchestrator-ip>:/tmp/clouds.yaml

# GitHub PAT: needs `Actions: read/write` and `Administration: read/write`
# (fine-grained) or classic `repo` scope.
scp ~/.airlabcloud/airstack-github-pat.txt \
ubuntu@<orchestrator-ip>:/tmp/github-pat
```

### 3. Run setup.sh

On the orchestrator VM:

```bash
git clone https://github.com/castacks/AirStack.git /tmp/airstack
sudo bash /tmp/airstack/.github/orchestrator/setup.sh
```

`setup.sh` creates the `orchestrator` system user, builds the Python venv, copies `orchestrator.py` and `cloud-init.yaml.j2` into `/opt/airstack-orchestrator/`, scaffolds `/etc/airstack-orchestrator/`, installs the systemd unit, and consumes `/tmp/github-pat`.

You still need to put the `clouds.yaml` in place under the orchestrator user's home:

```bash
sudo install -d -o orchestrator -g orchestrator -m 0700 \
/home/orchestrator/.config/openstack
sudo install -o orchestrator -g orchestrator -m 0600 \
/tmp/clouds.yaml /home/orchestrator/.config/openstack/clouds.yaml
sudo shred -u /tmp/clouds.yaml
```

### 4. Fill in `/etc/airstack-orchestrator/config.yaml`

Edit the placeholders the example ships with:

| Field | What goes here | How to find it |
|------|---------------|----------------|
| `flavor_name` | OpenStack flavor with GPU + enough disk | `openstack flavor list` |
| `network_name` | Network the workers attach to | `openstack network list` |
| `keypair_name` | SSH keypair for break-glass access | `openstack keypair list` |
| `security_group` | Outbound 443 must be allowed | `openstack security group list` |
| `availability_zone` | Optional AZ for the spawned instance; leave empty to let Nova pick | `openstack availability zone list` |
| `repo` | `owner/name` of the repo to poll | from GitHub URL |
| `runner_version` | Version tag from [actions/runner releases](https://github.com/actions/runner/releases) | check before each major upgrade |

### 5. Start the service

```bash
sudo systemctl enable --now airstack-orchestrator.service
journalctl -u airstack-orchestrator.service -f
```

You should see `orchestrator started: repo=... labels=... max_concurrent=N` and then periodic poll activity.

## End-to-end verification

```bash
# Trigger a fast build-only run.
gh workflow run integration-tests.yml -f marks=build_docker

# Within ~30s, a server should appear:
openstack server list --metadata airstack-role=ephemeral-runner

# Watch GitHub → Actions → Runners — the ephemeral runner should appear,
# pick up the job, then disappear.

# Within ~30s of job completion, the server should be gone:
openstack server list --metadata airstack-role=ephemeral-runner
```

## Operational notes

- **State file**: `/var/lib/airstack-orchestrator/state.json` is the in-flight job tracker. Wiping it triggers an orphan sweep on the next reap iteration — owned servers will be force-deleted. Don't wipe it while jobs are mid-flight unless that's what you want.
- **Stuck instance**: any server older than `max_job_minutes` (default 90) is force-deleted regardless of GitHub job status. Bump this if liveliness/autonomy runs grow longer than ~75 minutes.
- **PAT rotation**: `sudo install -o root -g orchestrator -m 0640 /tmp/new-pat /etc/airstack-orchestrator/github-pat && sudo systemctl restart airstack-orchestrator.service`.
- **Pause spawning** (e.g. for maintenance): `sudo systemctl stop airstack-orchestrator.service`. Already-spawned workers will still complete their jobs and self-shutdown; on restart, the reap loop deletes them.
- **Logs**: `journalctl -u airstack-orchestrator.service -f`. Cloud-init logs from individual workers are visible only via `openstack console log show <server>` while the worker is running.
40 changes: 40 additions & 0 deletions .github/orchestrator/airstack-orchestrator.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
[Unit]
Description=AirStack CI Orchestrator (spawns ephemeral OpenStack runners)
Documentation=https://github.com/castacks/AirStack/tree/main/.github/orchestrator
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=orchestrator
Group=orchestrator
WorkingDirectory=/opt/airstack-orchestrator

# Application credential lives in the orchestrator user's home so openstacksdk
# finds it via the default cloud-config search path.
Environment=HOME=/home/orchestrator
Environment=OS_CLIENT_CONFIG_FILE=/home/orchestrator/.config/openstack/clouds.yaml

ExecStart=/opt/airstack-orchestrator/venv/bin/python \
/opt/airstack-orchestrator/orchestrator.py \
--config /etc/airstack-orchestrator/config.yaml \
--pat /etc/airstack-orchestrator/github-pat \
--state /var/lib/airstack-orchestrator/state.json \
--template /opt/airstack-orchestrator/cloud-init.yaml.j2

Restart=always
RestartSec=10

# Allow draining loops on stop (SIGTERM handled by orchestrator.py).
TimeoutStopSec=30
KillSignal=SIGTERM

# Hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/airstack-orchestrator
PrivateTmp=true

[Install]
WantedBy=multi-user.target
71 changes: 71 additions & 0 deletions .github/orchestrator/cloud-init.yaml.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#cloud-config
# Rendered per-spawn by orchestrator.py with two Jinja variables:
# encoded_jit_config - single-use base64 JIT config from GitHub
# runner_version - GitHub Actions runner version (e.g. 2.319.1)
#
# The base image (Ubuntu-24.04-GPU-Headless) already has NVIDIA drivers.
# This cloud-init adds Docker (with the compose plugin), nvidia-container-toolkit,
# downloads the GitHub Actions runner, registers it with the JIT config, runs
# exactly one job (the JIT config + --ephemeral makes the runner exit after one
# job), and shuts the VM down. The orchestrator then deletes the server.

package_update: true
package_upgrade: false
packages:
- jq
- curl
- ca-certificates
- gnupg

write_files:
- path: /usr/local/bin/airstack-runner-bootstrap.sh
permissions: "0755"
owner: root:root
content: |
#!/usr/bin/env bash
set -euxo pipefail

# Install Docker (with compose plugin) from Docker's official channel.
# get.docker.com handles apt repo setup + nvidia-container-toolkit-compatible
# docker-ce, plus the docker-compose-plugin we need for `airstack up`.
curl -fsSL https://get.docker.com | sh

# nvidia-container-toolkit is required for GPU containers (liveliness /
# autonomy tests). The base image has the NVIDIA *drivers* but we still
# need the container runtime hooks here.
distribution=$(. /etc/os-release; echo "$ID$VERSION_ID")
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL "https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list" \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
> /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

usermod -aG docker ubuntu

# GitHub Actions runner.
RUNNER_VERSION="{{ runner_version }}"
RUNNER_DIR=/home/ubuntu/actions-runner
mkdir -p "$RUNNER_DIR"
cd "$RUNNER_DIR"
curl -fsSL -o runner.tar.gz \
"https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz"
tar xzf runner.tar.gz
rm runner.tar.gz
chown -R ubuntu:ubuntu "$RUNNER_DIR"

# Run exactly one job under the ubuntu user. The JIT config is single-use
# and ephemeral, so run.sh exits after one job completes.
sudo -u ubuntu --preserve-env=HOME -H bash -c \
"cd '$RUNNER_DIR' && ./run.sh --jitconfig '{{ encoded_jit_config }}'" \
|| echo "runner exited non-zero (job failure or runner error)"

# Backstop: power down. The orchestrator's reap loop is the authoritative
# deleter — it sees the GitHub job complete and calls Nova delete.
shutdown -h +1

runcmd:
- /usr/local/bin/airstack-runner-bootstrap.sh
60 changes: 60 additions & 0 deletions .github/orchestrator/config.example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# AirStack CI orchestrator configuration.
# Copy to /etc/airstack-orchestrator/config.yaml and fill in placeholders.

# --- OpenStack target ---

# Cloud profile name in ~/.config/openstack/clouds.yaml.
openstack_cloud: airstack

# Ubuntu-24.04-GPU-Headless (confirmed available on airlab-cloud).
image_id: a891a6fe-5e4f-4b84-a6c9-482848c8f972

# OpenStack flavor with GPU + enough disk for Docker + sim images.
# Look up with: openstack flavor list
flavor_name: ""

# OpenStack network the ephemeral instance attaches to. Must allow outbound
# 443 to api.github.com (no inbound is required: the runner makes an outbound
# long-poll connection to GitHub).
network_name: ""

# OpenStack keypair injected into the instance for break-glass SSH access.
# The orchestrator never SSHes into workers itself.
keypair_name: ""

# Security group applied to spawned instances. Outbound 443 must be allowed.
security_group: ""

# OpenStack availability zone to spawn instances in (e.g. nova, gpu-zone-1).
# Leave empty to let Nova pick.
availability_zone: ""

# --- GitHub ---

# owner/name of the repo whose queued workflow_jobs to pick up.
repo: "castacks/AirStack"

# Labels the orchestrator polls for. A queued workflow_job whose `labels`
# array is a superset of this list gets a server spawned for it.
runner_labels:
- self-hosted
- airstack-ephemeral

# GitHub Actions runner version (must exist as a release tag at
# https://github.com/actions/runner/releases).
runner_version: "2.319.1"

# --- Limits ---

# Maximum simultaneous in-flight ephemeral instances.
max_concurrent: 3

# Hard ceiling for a single job. Past this age the reaper force-deletes the
# server even if GitHub still reports the job as in-progress. Must comfortably
# exceed the longest expected job (autonomy/liveliness runs).
max_job_minutes: 90

# --- Polling intervals (seconds) ---

spawn_poll_interval_s: 15
reap_poll_interval_s: 30
Loading
Loading