DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response) by srt0422 · Pull Request #3 · allora-network/.github

srt0422 · 2026-05-13T07:22:58Z

Summary

Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the .github org repo so it surfaces on every repo's Security tab.

What landed

SECURITY-RUNBOOK.md at the repo root, covering all the required scenarios:

Detection sources — Falco, IOC sweep, Dependabot, secret scanning, manual report; channel + owner per source.
Triage decision tree in plain text (works on phone, in PDF, pasted into Slack).
Scenario A — dev workstation compromise (disconnect → revoke → wipe → rebuild, with the credential-by-credential checklist).
Scenario B — CI runner compromise (disable workflow, audit blast radius, rotate, audit recent publishes).
Scenario C — compromised package we published (yank/deprecate, advisory, clean rebuild + downstream notify).
Scenario D — cluster pod compromise (cordon, capture-before-delete forensics, rotate SA-scoped secrets, uncordon).
Token rotation cadence — quarterly default, with "rotate immediately if" triggers.
Tabletop exercise schedule — annual, Q1, with explicit format and skip-approval rule.
Appendix — gh search incantations, cosign verify cookbook, sweep trigger, node drain.

Style is operational, not compliance-flavored: short imperative steps, explicit owner per step, each scenario sectioned as Stop the bleed / Audit blast radius / Restore service / Close-out so the on-call can skim during an actual incident.

Linear

https://linear.app/alloralabs/issue/DEVOP-571

Test plan

Walk a DevOps engineer through Scenario A start-to-finish and check whether the credential-by-credential list misses anything they actually use.
Confirm the runbook renders correctly on the org Security tab after merge.
First annual tabletop (DEVOP-573) will be the real test — runbook should self-update based on what was slow/ambiguous.

🤖 Generated with Claude Code

Summary by cubic

Adds an org-wide incident response runbook for Shai-Hulud–class supply‑chain compromises. Lives in allora-network/.github as SECURITY-RUNBOOK.md so it shows on every repo’s Security tab and satisfies DEVOP-571.

New Features
- Detection sources with channel and owner.
- Plain‑text triage decision tree.
- Four scenarios with step‑by‑step actions: developer workstation, CI runner, compromised publish, and cluster pod (Stop the bleed → Audit → Restore → Close‑out), including deriving the ServiceAccount from saved pod YAML during Scenario D audits to avoid races after pod deletion.
- Token rotation cadence with immediate‑rotate triggers.
- Annual tabletop schedule and close‑out rules.
- Appendix with quick commands for gh, cosign, and kubectl.
Bug Fixes
- Scenario D SA grep fallback now uses POSIX [[:space:]] to work on BSD/macOS grep.

^{Written for commit 6574da0. Summary will update on new commits.}

Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the .github org repo so it surfaces on every repo's Security tab. Covers all required scenarios from the acceptance criteria: * Detection sources table (Falco, IOC sweep, Dependabot, secret scanning, manual report) with channel + owner per source. * Triage decision tree — flowchart in plain text so it works on phone / in PDF / pasted into Slack. * Scenario A: developer workstation compromise (disconnect, revoke, wipe, rebuild — explicit credential-by-credential checklist). * Scenario B: CI runner compromise (disable workflow, audit blast radius, rotate every secret in scope, audit recent publishes, restore on a clean rebuild). * Scenario C: compromised package published from our org (yank + deprecate, advisory, publish clean rebuild from clean environment, downstream notification, IOC list update). * Scenario D: cluster pod compromise (cordon, forensic capture in order, delete, audit ServiceAccount scope, rotate, uncordon). * Token rotation cadence — quarterly default, with per-credential rules and "rotate immediately if" triggers. * Tabletop exercise schedule — annual, with explicit format and skip-approval rule. * Appendix of useful commands (gh search, cosign verify, sweep trigger, node drain). The runbook deliberately reads like an actual operational document, not a compliance artifact: short imperative sentences, explicit owner per step, no passive voice. Each scenario starts with "Stop the bleed," "Audit blast radius," "Restore service," "Close-out" so an on-call can skim and find their position. Refs: https://linear.app/alloralabs/issue/DEVOP-571 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

cubic analysis

1 issue found across 1 file

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:349">
P2: Deriving the ServiceAccount from `$POD` after the pod is deleted can fail and break blast-radius auditing. Capture SA from the previously saved pod YAML (or before deletion) so audit steps remain executable.</violation>
</file>

Linked issue analysis

Linked issue: DEVOP-571: Write SECURITY-RUNBOOK.md in .github org repo

Status	Acceptance criteria	Notes
✅	Detection sources: where alerts fire (Falco → Slack, IOC sweep → GitHub Issue, Dependabot, secret scanning push protection)	SECURITY-RUNBOOK.md contains a '1. Detection sources' table listing Falco → #security-alerts, IOC sweep → GitHub issue workflow, Dependabot, and secret scanning push protection.
✅	Triage decision tree: confirm vs. false-positive; who pages whom	The runbook includes a Triage decision tree flowchart with branching for false positives and instructions on who to page (on-call, publisher).
✅	Dev machine suspected infected: disconnect, revoke tokens, wipe, rebuild	Scenario A lists immediate disconnect, revoke token steps (detailed per-credential), preserve evidence guidance, and wipe+reinstall + reissue credentials.
✅	CI runner suspected infected: disable workflow, rotate every secret in the env, audit recent publishes/pushes, yank+republish if needed	Scenario B instructs disabling workflows or scaling runner pool to 0, enumerates rotating every credential in scope, and auditing recent publishes (with escalation to Scenario C if suspect).
✅	Compromised package published: yank from npm/PyPI/Harbor, publish corrected version from clean environment, notify downstream	Scenario C details yanking/deprecating/unpublishing behavior per registry, steps to publish corrected version from a clean environment, and notifying downstream consumers plus updating IOC lists.
✅	Cluster pod suspected compromised: cordon node, capture forensic data via Falco/audit logs, delete pod, rotate secrets the pod could read, post-mortem	Scenario D prescribes cordoning the node, captures-forensics commands (kubectl describe/logs/exec), deleting or scaling to 0, listing/rotating secrets the pod could read, and mandatory post-mortem.
✅	Token rotation cadence: quarterly for any long-lived credential not on OIDC	Section 7 contains a token rotation table specifying quarterly rotation for PATs/npm/PyPI/Harbor and notes about migrating to OIDC.
✅	Tabletop exercise schedule: annual (see DEVOP-573)	Section 8 defines an annual Q1 tabletop exercise, format, attendees, and skip rules referencing DEVOP-573.
❌	PR merged	Merge is an acceptance condition but cannot be satisfied by the patch diff itself; the PR is adding the runbook but is not merged yet.

Architecture diagram

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant Slack as Slack #security-alerts
    participant Falco as Falco (Cluster Runtime)
    participant IOC as Daily IOC Sweep
    participant Oncall as DevOps On-Call
    participant Package as Package Registry (npm/PyPI)
    participant Cluster as Kubernetes Cluster
    
    Note over Dev,Cluster: Detection Sources
    
    Falco->>Oncall: Alert: container behavior violation
    Falco->>Slack: Cross-post alert via Falcosidekick
    IOC->>GH: Auto-file issue in incident-response repo
    IOC->>Slack: Cross-post alert
    GH->>Slack: Dependabot alert
    GH->>Dev: Secret scanning push blocked
    Dev->>Slack: Manual report: suspicious activity
    
    Note over Oncall,Cluster: Triage Decision Tree (5-10 min)
    
    Oncall->>Oncall: Acknowledge in Slack within 5 min
    alt Falco alert & known false-positive
        Oncall->>Slack: Ack, tune rule in flux-*/falco/rules.yaml
    else IOC match: package@version
        alt We published that package?
            Oncall->>Package: Scenario C: yank/deprecate package
            Oncall->>Oncall: Page publisher + on-call
        else Not our publish
            Oncall->>GH: Pin known-good version, open PR
        end
    else Secret detected
        Oncall->>Oncall: Rotate secret immediately (see §7)
        Oncall->>GH: Audit usage for last 90 days
    else Weird workstation behavior
        Oncall->>Dev: Scenario A
    else Weird CI runner behavior
        Oncall->>GH: Scenario B
    else Weird pod behavior
        Oncall->>Cluster: Scenario D
    else Unknown alert
        Oncall->>Slack: Dig in, file follow-up ticket
    end
    
    Note over Dev,Cluster: Scenario A - Workstation Compromise
    
    Dev->>Dev: Disconnect machine from network
    Dev->>Slack: Post alert from phone
    Dev->>GH: Revoke all PATs and SSH keys
    Dev->>Package: Revoke npm/PyPI tokens
    Dev->>Dev: Revoke AWS access keys
    Dev->>Dev: Wipe + reinstall OS
    Dev->>Dev: Reissue fine-grained credentials only
    
    Note over GH,Cluster: Scenario B - CI Runner Compromise
    
    Oncall->>GH: gh workflow disable <name>
    Oncall->>Cluster: Scale Arc runner replicas to 0
    Oncall->>Oncall: List runner access scope
    Oncall->>Oncall: Check GitHub Actions audit log
    
    Note over Dev,Cluster: Scenario C - Compromised Package
    
    Oncall->>Package: Yank/deprecate package version
    Oncall->>GH: Publish GHSA advisory
    Oncall->>Dev: Notify downstream consumers
    
    Note over GH,Cluster: Scenario D - Cluster Pod Compromise
    
    Oncall->>Cluster: kubectl cordon node
    Oncall->>Cluster: kubectl describe pod > capture.txt
    Oncall->>Cluster: kubectl delete pod
    Oncall->>Cluster: kubectl uncordon node
    
    Note over Dev,Oncall: Token Rotation Cadence (Quarterly)
    Note over GH,Cluster: Tabletop Exercise (Annual Q1)

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

After step 3 deletes the pod (or scales the deployment to 0), a live `kubectl get pod` lookup for the ServiceAccount in step 4 will fail or return the SA of a freshly-recreated replacement pod. Read the SA from the snapshot captured in step 2 so blast-radius auditing stays executable after containment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:356">
P1: Replace the non-portable `\s` with `[[:space:]]` in the grep fallback so the command works correctly on both GNU and BSD/macOS systems.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

BSD grep (default on macOS) does not honor the `\s` Perl-style shorthand inside `-E` patterns. Switch to `[[:space:]]` so the fallback works identically on GNU and BSD/macOS systems. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Documents the inaugural Shai-Hulud-class tabletop exercise: an injected "eliza-allora-plugin was published with a postinstall payload yesterday at 4pm" scenario that walks the team end-to-end through the SECURITY-RUNBOOK (DEVOP-571). The doc is operational, not a writeup. It contains: * The injected scenario, including the specific exfil mechanics, the IOC discovery timeline, and the T+0 trigger. * Pre-assigned roles (incident lead, communicator, executor, BE rep, FE rep, founder observer-only) with explicit don't-skip-a-role rule. * Six phases keyed to runbook sections, each with a target elapsed time and explicit success/failure modes the facilitator watches for. * The 30-minute time-to-clean-republish target broken into 4 phases (T+5 / T+10 / T+20 / T+30) so participants can self-check progress mid-exercise. * A debrief script (6 questions, in order) that produces ticket inputs verbatim from the team's own language. * Output checklist for the facilitator (Linear tickets, runbook PR, lessons-learned section update, next-year calendar invite). * Notes-from-runbook-author section identifying the three seams in the runbook that the exercise should specifically stress. The exercise itself is a team activity and is NOT considered complete until the run + debrief actually happen. DEVOP-573 stays In Review until the facilitator schedules and runs the live session. Blocks-by: DEVOP-571 (runbook). PR allora-network#3 in this repo authors the runbook; this PR cross-references it. Refs: https://linear.app/alloralabs/issue/DEVOP-573 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

srt0422 mentioned this pull request May 13, 2026

DEVOP-572: add CONTRIBUTING.md with dev-workstation security guidance #4

Merged

3 tasks

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread SECURITY-RUNBOOK.md Outdated

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread SECURITY-RUNBOOK.md Outdated

srt0422 added the shai-hulud Shai-Hulud supply-chain defense work label May 13, 2026

srt0422 requested review from Kouteki, TaniBuilds, gh-allora and spooktheducks May 14, 2026 05:48

srt0422 added the needs-human-review label May 14, 2026

srt0422 assigned spooktheducks May 14, 2026

spooktheducks merged commit 7365652 into allora-network:main May 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3
spooktheducks merged 3 commits into
allora-network:mainfrom
srt0422:devop-571-security-runbook

srt0422 commented May 13, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

srt0422 commented May 13, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What landed

Linear

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

cubic analysis

Linked issue analysis

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srt0422 commented May 13, 2026 •

edited by cubic-dev-ai Bot

Loading