Skip to content

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3

Merged
spooktheducks merged 3 commits into
allora-network:mainfrom
srt0422:devop-571-security-runbook
May 14, 2026
Merged

DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3
spooktheducks merged 3 commits into
allora-network:mainfrom
srt0422:devop-571-security-runbook

Conversation

@srt0422
Copy link
Copy Markdown

@srt0422 srt0422 commented May 13, 2026

Summary

Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the .github org repo so it surfaces on every repo's Security tab.

What landed

SECURITY-RUNBOOK.md at the repo root, covering all the required scenarios:

  • Detection sources — Falco, IOC sweep, Dependabot, secret scanning, manual report; channel + owner per source.
  • Triage decision tree in plain text (works on phone, in PDF, pasted into Slack).
  • Scenario A — dev workstation compromise (disconnect → revoke → wipe → rebuild, with the credential-by-credential checklist).
  • Scenario B — CI runner compromise (disable workflow, audit blast radius, rotate, audit recent publishes).
  • Scenario C — compromised package we published (yank/deprecate, advisory, clean rebuild + downstream notify).
  • Scenario D — cluster pod compromise (cordon, capture-before-delete forensics, rotate SA-scoped secrets, uncordon).
  • Token rotation cadence — quarterly default, with "rotate immediately if" triggers.
  • Tabletop exercise schedule — annual, Q1, with explicit format and skip-approval rule.
  • Appendix — gh search incantations, cosign verify cookbook, sweep trigger, node drain.

Style is operational, not compliance-flavored: short imperative steps, explicit owner per step, each scenario sectioned as Stop the bleed / Audit blast radius / Restore service / Close-out so the on-call can skim during an actual incident.

Linear

https://linear.app/alloralabs/issue/DEVOP-571

Test plan

  • Walk a DevOps engineer through Scenario A start-to-finish and check whether the credential-by-credential list misses anything they actually use.
  • Confirm the runbook renders correctly on the org Security tab after merge.
  • First annual tabletop (DEVOP-573) will be the real test — runbook should self-update based on what was slow/ambiguous.

🤖 Generated with Claude Code


Summary by cubic

Adds an org-wide incident response runbook for Shai-Hulud–class supply‑chain compromises. Lives in allora-network/.github as SECURITY-RUNBOOK.md so it shows on every repo’s Security tab and satisfies DEVOP-571.

  • New Features

    • Detection sources with channel and owner.
    • Plain‑text triage decision tree.
    • Four scenarios with step‑by‑step actions: developer workstation, CI runner, compromised publish, and cluster pod (Stop the bleed → Audit → Restore → Close‑out), including deriving the ServiceAccount from saved pod YAML during Scenario D audits to avoid races after pod deletion.
    • Token rotation cadence with immediate‑rotate triggers.
    • Annual tabletop schedule and close‑out rules.
    • Appendix with quick commands for gh, cosign, and kubectl.
  • Bug Fixes

    • Scenario D SA grep fallback now uses POSIX [[:space:]] to work on BSD/macOS grep.

Written for commit 6574da0. Summary will update on new commits.

Org-wide incident response runbook for Shai-Hulud-class supply-chain
compromise. Lives in the .github org repo so it surfaces on every repo's
Security tab.

Covers all required scenarios from the acceptance criteria:

* Detection sources table (Falco, IOC sweep, Dependabot, secret scanning,
  manual report) with channel + owner per source.
* Triage decision tree — flowchart in plain text so it works on phone /
  in PDF / pasted into Slack.
* Scenario A: developer workstation compromise (disconnect, revoke,
  wipe, rebuild — explicit credential-by-credential checklist).
* Scenario B: CI runner compromise (disable workflow, audit blast radius,
  rotate every secret in scope, audit recent publishes, restore on a
  clean rebuild).
* Scenario C: compromised package published from our org (yank +
  deprecate, advisory, publish clean rebuild from clean environment,
  downstream notification, IOC list update).
* Scenario D: cluster pod compromise (cordon, forensic capture in order,
  delete, audit ServiceAccount scope, rotate, uncordon).
* Token rotation cadence — quarterly default, with per-credential rules
  and "rotate immediately if" triggers.
* Tabletop exercise schedule — annual, with explicit format and
  skip-approval rule.
* Appendix of useful commands (gh search, cosign verify, sweep trigger,
  node drain).

The runbook deliberately reads like an actual operational document, not
a compliance artifact: short imperative sentences, explicit owner per
step, no passive voice. Each scenario starts with "Stop the bleed,"
"Audit blast radius," "Restore service," "Close-out" so an on-call can
skim and find their position.

Refs: https://linear.app/alloralabs/issue/DEVOP-571

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:349">
P2: Deriving the ServiceAccount from `$POD` after the pod is deleted can fail and break blast-radius auditing. Capture SA from the previously saved pod YAML (or before deletion) so audit steps remain executable.</violation>
</file>

Linked issue analysis

Linked issue: DEVOP-571: Write SECURITY-RUNBOOK.md in .github org repo

Status Acceptance criteria Notes
Detection sources: where alerts fire (Falco → Slack, IOC sweep → GitHub Issue, Dependabot, secret scanning push protection) SECURITY-RUNBOOK.md contains a '1. Detection sources' table listing Falco → #security-alerts, IOC sweep → GitHub issue workflow, Dependabot, and secret scanning push protection.
Triage decision tree: confirm vs. false-positive; who pages whom The runbook includes a Triage decision tree flowchart with branching for false positives and instructions on who to page (on-call, publisher).
Dev machine suspected infected: disconnect, revoke tokens, wipe, rebuild Scenario A lists immediate disconnect, revoke token steps (detailed per-credential), preserve evidence guidance, and wipe+reinstall + reissue credentials.
CI runner suspected infected: disable workflow, rotate every secret in the env, audit recent publishes/pushes, yank+republish if needed Scenario B instructs disabling workflows or scaling runner pool to 0, enumerates rotating every credential in scope, and auditing recent publishes (with escalation to Scenario C if suspect).
Compromised package published: yank from npm/PyPI/Harbor, publish corrected version from clean environment, notify downstream Scenario C details yanking/deprecating/unpublishing behavior per registry, steps to publish corrected version from a clean environment, and notifying downstream consumers plus updating IOC lists.
Cluster pod suspected compromised: cordon node, capture forensic data via Falco/audit logs, delete pod, rotate secrets the pod could read, post-mortem Scenario D prescribes cordoning the node, captures-forensics commands (kubectl describe/logs/exec), deleting or scaling to 0, listing/rotating secrets the pod could read, and mandatory post-mortem.
Token rotation cadence: quarterly for any long-lived credential not on OIDC Section 7 contains a token rotation table specifying quarterly rotation for PATs/npm/PyPI/Harbor and notes about migrating to OIDC.
Tabletop exercise schedule: annual (see DEVOP-573) Section 8 defines an annual Q1 tabletop exercise, format, attendees, and skip rules referencing DEVOP-573.
PR merged Merge is an acceptance condition but cannot be satisfied by the patch diff itself; the PR is adding the runbook but is not merged yet.
Architecture diagram
sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant Slack as Slack #security-alerts
    participant Falco as Falco (Cluster Runtime)
    participant IOC as Daily IOC Sweep
    participant Oncall as DevOps On-Call
    participant Package as Package Registry (npm/PyPI)
    participant Cluster as Kubernetes Cluster
    
    Note over Dev,Cluster: Detection Sources
    
    Falco->>Oncall: Alert: container behavior violation
    Falco->>Slack: Cross-post alert via Falcosidekick
    IOC->>GH: Auto-file issue in incident-response repo
    IOC->>Slack: Cross-post alert
    GH->>Slack: Dependabot alert
    GH->>Dev: Secret scanning push blocked
    Dev->>Slack: Manual report: suspicious activity
    
    Note over Oncall,Cluster: Triage Decision Tree (5-10 min)
    
    Oncall->>Oncall: Acknowledge in Slack within 5 min
    alt Falco alert & known false-positive
        Oncall->>Slack: Ack, tune rule in flux-*/falco/rules.yaml
    else IOC match: package@version
        alt We published that package?
            Oncall->>Package: Scenario C: yank/deprecate package
            Oncall->>Oncall: Page publisher + on-call
        else Not our publish
            Oncall->>GH: Pin known-good version, open PR
        end
    else Secret detected
        Oncall->>Oncall: Rotate secret immediately (see §7)
        Oncall->>GH: Audit usage for last 90 days
    else Weird workstation behavior
        Oncall->>Dev: Scenario A
    else Weird CI runner behavior
        Oncall->>GH: Scenario B
    else Weird pod behavior
        Oncall->>Cluster: Scenario D
    else Unknown alert
        Oncall->>Slack: Dig in, file follow-up ticket
    end
    
    Note over Dev,Cluster: Scenario A - Workstation Compromise
    
    Dev->>Dev: Disconnect machine from network
    Dev->>Slack: Post alert from phone
    Dev->>GH: Revoke all PATs and SSH keys
    Dev->>Package: Revoke npm/PyPI tokens
    Dev->>Dev: Revoke AWS access keys
    Dev->>Dev: Wipe + reinstall OS
    Dev->>Dev: Reissue fine-grained credentials only
    
    Note over GH,Cluster: Scenario B - CI Runner Compromise
    
    Oncall->>GH: gh workflow disable <name>
    Oncall->>Cluster: Scale Arc runner replicas to 0
    Oncall->>Oncall: List runner access scope
    Oncall->>Oncall: Check GitHub Actions audit log
    
    Note over Dev,Cluster: Scenario C - Compromised Package
    
    Oncall->>Package: Yank/deprecate package version
    Oncall->>GH: Publish GHSA advisory
    Oncall->>Dev: Notify downstream consumers
    
    Note over GH,Cluster: Scenario D - Cluster Pod Compromise
    
    Oncall->>Cluster: kubectl cordon node
    Oncall->>Cluster: kubectl describe pod > capture.txt
    Oncall->>Cluster: kubectl delete pod
    Oncall->>Cluster: kubectl uncordon node
    
    Note over Dev,Oncall: Token Rotation Cadence (Quarterly)
    Note over GH,Cluster: Tabletop Exercise (Annual Q1)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread SECURITY-RUNBOOK.md Outdated
After step 3 deletes the pod (or scales the deployment to 0), a live
`kubectl get pod` lookup for the ServiceAccount in step 4 will fail or
return the SA of a freshly-recreated replacement pod. Read the SA from
the snapshot captured in step 2 so blast-radius auditing stays
executable after containment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="SECURITY-RUNBOOK.md">

<violation number="1" location="SECURITY-RUNBOOK.md:356">
P1: Replace the non-portable `\s` with `[[:space:]]` in the grep fallback so the command works correctly on both GNU and BSD/macOS systems.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread SECURITY-RUNBOOK.md Outdated
BSD grep (default on macOS) does not honor the `\s` Perl-style
shorthand inside `-E` patterns. Switch to `[[:space:]]` so the
fallback works identically on GNU and BSD/macOS systems.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@srt0422 srt0422 added the shai-hulud Shai-Hulud supply-chain defense work label May 13, 2026
@spooktheducks spooktheducks merged commit 7365652 into allora-network:main May 14, 2026
1 check passed
srt0422 added a commit to srt0422/.github that referenced this pull request May 22, 2026
Documents the inaugural Shai-Hulud-class tabletop exercise: an injected
"eliza-allora-plugin was published with a postinstall payload yesterday
at 4pm" scenario that walks the team end-to-end through the
SECURITY-RUNBOOK (DEVOP-571).

The doc is operational, not a writeup. It contains:

* The injected scenario, including the specific exfil mechanics, the
  IOC discovery timeline, and the T+0 trigger.
* Pre-assigned roles (incident lead, communicator, executor, BE rep,
  FE rep, founder observer-only) with explicit don't-skip-a-role rule.
* Six phases keyed to runbook sections, each with a target elapsed
  time and explicit success/failure modes the facilitator watches
  for.
* The 30-minute time-to-clean-republish target broken into 4 phases
  (T+5 / T+10 / T+20 / T+30) so participants can self-check progress
  mid-exercise.
* A debrief script (6 questions, in order) that produces ticket
  inputs verbatim from the team's own language.
* Output checklist for the facilitator (Linear tickets, runbook PR,
  lessons-learned section update, next-year calendar invite).
* Notes-from-runbook-author section identifying the three seams in
  the runbook that the exercise should specifically stress.

The exercise itself is a team activity and is NOT considered complete
until the run + debrief actually happen. DEVOP-573 stays In Review
until the facilitator schedules and runs the live session.

Blocks-by: DEVOP-571 (runbook). PR allora-network#3 in this repo authors the
runbook; this PR cross-references it.

Refs: https://linear.app/alloralabs/issue/DEVOP-573

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review shai-hulud Shai-Hulud supply-chain defense work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants