Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 31 additions & 17 deletions apps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,27 +56,41 @@ Add `env` to inject config:
}
```

Add `expose` to ask DD to route a public hostname to a workload's port:
Add `expose` to ask DD to route a public hostname to a workload's port.
Two shapes:

**Per-agent (auto)** — URL is derived from the agent's UUID; good for
anything that's naturally per-VM:

```json
{
"app_name": "web-nvidia-smi",
"expose": { "hostname_label": "gpu", "port": 8081 },
"cmd": [...]
}
{ "expose": { "hostname_label": "my-label", "port": 8081 } }
```

Becomes `<agent-hostname-base>-my-label.devopsdefender.com` (one level
deep, so Universal SSL covers it).

**Vanity claim** — a stable short URL directly under the zone apex.
First agent to register the claim wins; DNS uniqueness arbitrates. If
another agent tries to deploy the same spec, the CP returns 409.

```json
{ "expose": { "claim_hostname": "nvidia-smi", "port": 8081 } }
```

At agent boot, `apps/_infra/local-agents.sh` collects every `expose` entry
into `DD_EXTRA_INGRESS`. dd-agent forwards them on `/register` and the CP
prepends them to the agent's cloudflared tunnel ingress. A workload declaring
`{"hostname_label": "gpu", "port": 8081}` becomes reachable at
`gpu.<agent-hostname>` — in addition to the default dashboard at
`<agent-hostname>`. easyenclave itself ignores the field; it's a DD-level
hint about tunnel routing.

Per-workload ingress is **boot-time only** today. Workloads POSTed later via
`/deploy` don't get auto-exposed — declare your exposure on boot workloads in
this tree.
Becomes `nvidia-smi.devopsdefender.com`. When the owning agent dies,
the CP's collector releases the claim so the next deploy can take it.

At agent boot, `apps/_infra/local-agents.sh` collects every `expose`
entry into `DD_EXTRA_INGRESS`. Claims are marked with a `@` prefix
(`@nvidia-smi:8081`) to distinguish them from auto-labels. dd-agent
parses the env var, splits into the two variants, and forwards both
on `/register`. The CP prepends them to the agent's cloudflared
tunnel ingress and provisions the CNAMEs + CF Access apps. Easyenclave
itself ignores the field; it's a DD-level hint about tunnel routing.

Per-workload ingress is boot-time **and** runtime — any workload
POSTed via `/deploy` with an `expose` block also gets added to the
agent's tunnel via `/ingress/replace`.

## Templates

Expand Down
28 changes: 21 additions & 7 deletions apps/_infra/local-agents.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,28 @@ bake() {
}

# Extract `expose` entries from a stream of baked workloads and emit
# them as a comma-separated `label:port` string — the shape dd-agent
# expects in $DD_EXTRA_INGRESS. Using plain text (not JSON) avoids
# quote-escaping when the value gets substituted into the dd-agent
# workload template's `"DD_EXTRA_INGRESS=${DD_EXTRA_INGRESS}"` env
# entry: embedded `"` would close the outer JSON string early and
# produce invalid JSON (jq: "Invalid numeric literal").
# them as a comma-separated string — the shape dd-agent expects in
# $DD_EXTRA_INGRESS. Two variants per entry:
#
# label:port — auto per-agent (e.g. `web:9000` routes
# `<agent>-web.<domain>` to localhost:9000)
# @claim:port — vanity zone-apex claim (e.g. `@nvidia-smi:8081`
# routes `nvidia-smi.<domain>` to localhost:8081
# on the first agent to register it)
#
# Using plain text (not JSON) avoids quote-escaping when the value
# gets substituted into the dd-agent workload template's
# `"DD_EXTRA_INGRESS=${DD_EXTRA_INGRESS}"` env entry: embedded `"`
# would close the outer JSON string early and produce invalid JSON
# (jq: "Invalid numeric literal").
extract_extra_ingress() {
jq -rs 'map(select(.expose) | "\(.expose.hostname_label):\(.expose.port)") | join(",")'
jq -rs 'map(
select(.expose)
| if .expose.claim_hostname
then "@\(.expose.claim_hostname):\(.expose.port)"
else "\(.expose.hostname_label):\(.expose.port)"
end
) | join(",")'
}

[ -r "$BASE" ] || { echo "missing $BASE" >&2; exit 1; }
Expand Down
2 changes: 1 addition & 1 deletion apps/web-nvidia-smi/workload.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"app_name": "web-nvidia-smi",
"expose": { "hostname_label": "gpu", "port": 8081 },
"expose": { "claim_hostname": "nvidia-smi", "port": 8081 },
"cmd": [
"/bin/busybox", "sh", "-c",
"until [ -x /var/lib/easyenclave/bin/podman ]; do sleep 2; done\nexec /var/lib/easyenclave/bin/podman run --rm --name web-nvidia-smi --network=host --device=/dev/nvidia0 --device=/dev/nvidiactl --device=/dev/nvidia-uvm docker.io/nvidia/cuda:12.6.1-base-ubuntu22.04 sh -c 'set -e; apt-get update -qq && apt-get install -y -qq --no-install-recommends netcat-openbsd >/dev/null; while true; do (printf \"HTTP/1.0 200 OK\\r\\nContent-Type: text/plain\\r\\n\\r\\n\"; nvidia-smi) | nc -l -p 8081 -q 1; done'"
Expand Down
126 changes: 98 additions & 28 deletions src/agent.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,12 @@ struct St {
/// agent forwards the full list on every /ingress/replace call
/// so the CP's PUT is a straight replacement.
extras: Arc<RwLock<Vec<(String, u16)>>>,
/// Live set of vanity zone-apex claims (`@name:port`). Same
/// lifecycle as `extras` — seeded from boot, appended by runtime
/// deploys. The CP rejects /ingress/replace with 409 if a claim
/// is already owned by another agent, so the local list can hold
/// unconfirmed claims momentarily until the next replace reconciles.
claims: Arc<RwLock<Vec<(String, u16)>>>,
/// Verifier for GitHub Actions OIDC JWTs — the auth on /deploy
/// and /exec. CI workflows in the DD_OWNER org can call them
/// without any shared secret; anyone else is denied at claim
Expand Down Expand Up @@ -120,6 +126,7 @@ pub async fn run() -> Result<()> {
started: Instant::now(),
ita_token,
extras: Arc::new(RwLock::new(cfg.extra_ingress.clone())),
claims: Arc::new(RwLock::new(cfg.claims.clone())),
gh,
};

Expand Down Expand Up @@ -152,10 +159,19 @@ struct Bootstrap {
async fn register(cfg: &Cfg, ita_token: &str) -> Result<Bootstrap> {
let http = reqwest::Client::new();
let url = format!("{}/register", cfg.cp_url.trim_end_matches('/'));
// Each entry carries EITHER `hostname_label` (auto per-agent) OR
// `claim_hostname` (vanity zone-apex claim). The CP rejects the
// whole register with 409 if any claim collides with another
// live agent's claim — DNS uniqueness is the lock.
let extra_ingress: Vec<serde_json::Value> = cfg
.extra_ingress
.iter()
.map(|(label, port)| serde_json::json!({"hostname_label": label, "port": port}))
.chain(
cfg.claims
.iter()
.map(|(name, port)| serde_json::json!({"claim_hostname": name, "port": port})),
)
.collect();
let body = serde_json::json!({
"vm_name": cfg.common.vm_name,
Expand Down Expand Up @@ -233,12 +249,22 @@ async fn health(State(s): State<St>) -> Json<serde_json::Value> {
.unwrap_or_default();
let m = metrics::collect().await;
let ita_token = s.ita_token.read().await.clone();
// /health reports both auto-labeled extras and vanity claims
// under `extra_ingress` so the CP's collector can rebuild the
// per-agent state after a CP restart without a fresh /register.
let extra_ingress: Vec<serde_json::Value> = s
.extras
.read()
.await
.iter()
.map(|(label, port)| serde_json::json!({"hostname_label": label, "port": port}))
.chain(
s.claims
.read()
.await
.iter()
.map(|(name, port)| serde_json::json!({"claim_hostname": name, "port": port})),
)
.collect();

Json(serde_json::json!({
Expand Down Expand Up @@ -450,53 +476,93 @@ async fn deploy(

let response = s.ee.deploy(spec).await?;

if let Some((label, port)) = expose {
if let Err(e) = push_extra_ingress(&s, label.clone(), port).await {
if let Some(entry) = expose {
if let Err(e) = push_extra_ingress(&s, entry).await {
// Soft-fail: the workload is deployed, the owner just can't
// reach it from the public internet yet. Better than failing
// the whole /deploy and leaving the caller unsure whether
// the process is running.
eprintln!(
"agent: /ingress/replace add {label}:{port} failed (workload still running): {e}"
);
eprintln!("agent: /ingress/replace failed (workload still running): {e}");
}
}

Ok(Json(response))
}

/// Extract `expose.hostname_label` + `expose.port` from a DeployRequest
/// JSON body. Returns None if the field is missing or malformed; the
/// caller treats that as "no runtime ingress requested" and moves on.
fn parse_expose(spec: &serde_json::Value) -> Option<(String, u16)> {
/// Parsed form of a workload's `expose:` block. Each workload may
/// declare at most one of these.
enum ExposeEntry {
Auto { label: String, port: u16 },
Claim { name: String, port: u16 },
}

/// Extract `expose.hostname_label`/`expose.claim_hostname` + `expose.port`
/// from a DeployRequest JSON body. Returns None if `expose` is missing
/// or malformed; the caller treats that as "no runtime ingress
/// requested" and moves on.
fn parse_expose(spec: &serde_json::Value) -> Option<ExposeEntry> {
let expose = spec.get("expose")?;
let label = expose.get("hostname_label")?.as_str()?.to_string();
let port = expose.get("port")?.as_u64()?;
if label.is_empty() || port == 0 || port > u16::MAX as u64 {
if port == 0 || port > u16::MAX as u64 {
return None;
}
Some((label, port as u16))
let port = port as u16;
if let Some(name) = expose.get("claim_hostname").and_then(|v| v.as_str()) {
if name.is_empty() {
return None;
}
return Some(ExposeEntry::Claim {
name: name.to_string(),
port,
});
}
if let Some(label) = expose.get("hostname_label").and_then(|v| v.as_str()) {
if label.is_empty() {
return None;
}
return Some(ExposeEntry::Auto {
label: label.to_string(),
port,
});
}
None
}

/// Append `(label, port)` to the live extras list (dedup by label —
/// redeploying the same app_name with the same hostname_label is a
/// no-op, not a duplicate rule) and POST the full list to the CP's
/// /ingress/replace endpoint. The CP re-PUTs the tunnel config and
/// upserts CNAMEs.
async fn push_extra_ingress(s: &St, label: String, port: u16) -> Result<()> {
let extras = {
let mut guard = s.extras.write().await;
if let Some(existing) = guard.iter_mut().find(|(l, _)| *l == label) {
existing.1 = port;
} else {
guard.push((label, port));
/// Upsert a workload expose entry (auto-labeled or vanity) into the
/// live state and POST the full reconciled ingress to the CP's
/// `/ingress/replace` endpoint. The CP re-PUTs the tunnel config,
/// upserts CNAMEs, and provisions CF Access apps. Returns 409-like
/// errors when a claim collides with another agent.
async fn push_extra_ingress(s: &St, entry: ExposeEntry) -> Result<()> {
match entry {
ExposeEntry::Auto { label, port } => {
let mut guard = s.extras.write().await;
if let Some(existing) = guard.iter_mut().find(|(l, _)| *l == label) {
existing.1 = port;
} else {
guard.push((label, port));
}
}
guard.clone()
};
ExposeEntry::Claim { name, port } => {
let mut guard = s.claims.write().await;
if let Some(existing) = guard.iter_mut().find(|(n, _)| *n == name) {
existing.1 = port;
} else {
guard.push((name, port));
}
}
}

let body_extras: Vec<serde_json::Value> = extras
let extras_snapshot = s.extras.read().await.clone();
let claims_snapshot = s.claims.read().await.clone();
let body_extras: Vec<serde_json::Value> = extras_snapshot
.iter()
.map(|(l, p)| serde_json::json!({"hostname_label": l, "port": p}))
.chain(
claims_snapshot
.iter()
.map(|(n, p)| serde_json::json!({"claim_hostname": n, "port": p})),
)
.collect();
let ita_token = s.ita_token.read().await.clone();
let body = serde_json::json!({
Expand All @@ -519,7 +585,11 @@ async fn push_extra_ingress(s: &St, label: String, port: u16) -> Result<()> {
"ingress/replace {url} → {status}: {text}"
)));
}
eprintln!("agent: ingress/replace ok ({} extras total)", extras.len());
eprintln!(
"agent: ingress/replace ok ({} auto + {} claims)",
extras_snapshot.len(),
claims_snapshot.len()
);
Ok(())
}

Expand Down
Loading
Loading