DEVOP-571: add SECURITY-RUNBOOK.md (Shai-Hulud incident response)#3
Merged
spooktheducks merged 3 commits intoMay 14, 2026
Merged
Conversation
Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the .github org repo so it surfaces on every repo's Security tab. Covers all required scenarios from the acceptance criteria: * Detection sources table (Falco, IOC sweep, Dependabot, secret scanning, manual report) with channel + owner per source. * Triage decision tree — flowchart in plain text so it works on phone / in PDF / pasted into Slack. * Scenario A: developer workstation compromise (disconnect, revoke, wipe, rebuild — explicit credential-by-credential checklist). * Scenario B: CI runner compromise (disable workflow, audit blast radius, rotate every secret in scope, audit recent publishes, restore on a clean rebuild). * Scenario C: compromised package published from our org (yank + deprecate, advisory, publish clean rebuild from clean environment, downstream notification, IOC list update). * Scenario D: cluster pod compromise (cordon, forensic capture in order, delete, audit ServiceAccount scope, rotate, uncordon). * Token rotation cadence — quarterly default, with per-credential rules and "rotate immediately if" triggers. * Tabletop exercise schedule — annual, with explicit format and skip-approval rule. * Appendix of useful commands (gh search, cosign verify, sweep trigger, node drain). The runbook deliberately reads like an actual operational document, not a compliance artifact: short imperative sentences, explicit owner per step, no passive voice. Each scenario starts with "Stop the bleed," "Audit blast radius," "Restore service," "Close-out" so an on-call can skim and find their position. Refs: https://linear.app/alloralabs/issue/DEVOP-571 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
There was a problem hiding this comment.
cubic analysis
1 issue found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="SECURITY-RUNBOOK.md">
<violation number="1" location="SECURITY-RUNBOOK.md:349">
P2: Deriving the ServiceAccount from `$POD` after the pod is deleted can fail and break blast-radius auditing. Capture SA from the previously saved pod YAML (or before deletion) so audit steps remain executable.</violation>
</file>
Linked issue analysis
Linked issue: DEVOP-571: Write SECURITY-RUNBOOK.md in .github org repo
| Status | Acceptance criteria | Notes |
|---|---|---|
| ✅ | Detection sources: where alerts fire (Falco → Slack, IOC sweep → GitHub Issue, Dependabot, secret scanning push protection) | SECURITY-RUNBOOK.md contains a '1. Detection sources' table listing Falco → #security-alerts, IOC sweep → GitHub issue workflow, Dependabot, and secret scanning push protection. |
| ✅ | Triage decision tree: confirm vs. false-positive; who pages whom | The runbook includes a Triage decision tree flowchart with branching for false positives and instructions on who to page (on-call, publisher). |
| ✅ | Dev machine suspected infected: disconnect, revoke tokens, wipe, rebuild | Scenario A lists immediate disconnect, revoke token steps (detailed per-credential), preserve evidence guidance, and wipe+reinstall + reissue credentials. |
| ✅ | CI runner suspected infected: disable workflow, rotate every secret in the env, audit recent publishes/pushes, yank+republish if needed | Scenario B instructs disabling workflows or scaling runner pool to 0, enumerates rotating every credential in scope, and auditing recent publishes (with escalation to Scenario C if suspect). |
| ✅ | Compromised package published: yank from npm/PyPI/Harbor, publish corrected version from clean environment, notify downstream | Scenario C details yanking/deprecating/unpublishing behavior per registry, steps to publish corrected version from a clean environment, and notifying downstream consumers plus updating IOC lists. |
| ✅ | Cluster pod suspected compromised: cordon node, capture forensic data via Falco/audit logs, delete pod, rotate secrets the pod could read, post-mortem | Scenario D prescribes cordoning the node, captures-forensics commands (kubectl describe/logs/exec), deleting or scaling to 0, listing/rotating secrets the pod could read, and mandatory post-mortem. |
| ✅ | Token rotation cadence: quarterly for any long-lived credential not on OIDC | Section 7 contains a token rotation table specifying quarterly rotation for PATs/npm/PyPI/Harbor and notes about migrating to OIDC. |
| ✅ | Tabletop exercise schedule: annual (see DEVOP-573) | Section 8 defines an annual Q1 tabletop exercise, format, attendees, and skip rules referencing DEVOP-573. |
| ❌ | PR merged | Merge is an acceptance condition but cannot be satisfied by the patch diff itself; the PR is adding the runbook but is not merged yet. |
Architecture diagram
sequenceDiagram
participant Dev as Developer
participant GH as GitHub
participant Slack as Slack #security-alerts
participant Falco as Falco (Cluster Runtime)
participant IOC as Daily IOC Sweep
participant Oncall as DevOps On-Call
participant Package as Package Registry (npm/PyPI)
participant Cluster as Kubernetes Cluster
Note over Dev,Cluster: Detection Sources
Falco->>Oncall: Alert: container behavior violation
Falco->>Slack: Cross-post alert via Falcosidekick
IOC->>GH: Auto-file issue in incident-response repo
IOC->>Slack: Cross-post alert
GH->>Slack: Dependabot alert
GH->>Dev: Secret scanning push blocked
Dev->>Slack: Manual report: suspicious activity
Note over Oncall,Cluster: Triage Decision Tree (5-10 min)
Oncall->>Oncall: Acknowledge in Slack within 5 min
alt Falco alert & known false-positive
Oncall->>Slack: Ack, tune rule in flux-*/falco/rules.yaml
else IOC match: package@version
alt We published that package?
Oncall->>Package: Scenario C: yank/deprecate package
Oncall->>Oncall: Page publisher + on-call
else Not our publish
Oncall->>GH: Pin known-good version, open PR
end
else Secret detected
Oncall->>Oncall: Rotate secret immediately (see §7)
Oncall->>GH: Audit usage for last 90 days
else Weird workstation behavior
Oncall->>Dev: Scenario A
else Weird CI runner behavior
Oncall->>GH: Scenario B
else Weird pod behavior
Oncall->>Cluster: Scenario D
else Unknown alert
Oncall->>Slack: Dig in, file follow-up ticket
end
Note over Dev,Cluster: Scenario A - Workstation Compromise
Dev->>Dev: Disconnect machine from network
Dev->>Slack: Post alert from phone
Dev->>GH: Revoke all PATs and SSH keys
Dev->>Package: Revoke npm/PyPI tokens
Dev->>Dev: Revoke AWS access keys
Dev->>Dev: Wipe + reinstall OS
Dev->>Dev: Reissue fine-grained credentials only
Note over GH,Cluster: Scenario B - CI Runner Compromise
Oncall->>GH: gh workflow disable <name>
Oncall->>Cluster: Scale Arc runner replicas to 0
Oncall->>Oncall: List runner access scope
Oncall->>Oncall: Check GitHub Actions audit log
Note over Dev,Cluster: Scenario C - Compromised Package
Oncall->>Package: Yank/deprecate package version
Oncall->>GH: Publish GHSA advisory
Oncall->>Dev: Notify downstream consumers
Note over GH,Cluster: Scenario D - Cluster Pod Compromise
Oncall->>Cluster: kubectl cordon node
Oncall->>Cluster: kubectl describe pod > capture.txt
Oncall->>Cluster: kubectl delete pod
Oncall->>Cluster: kubectl uncordon node
Note over Dev,Oncall: Token Rotation Cadence (Quarterly)
Note over GH,Cluster: Tabletop Exercise (Annual Q1)
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
After step 3 deletes the pod (or scales the deployment to 0), a live `kubectl get pod` lookup for the ServiceAccount in step 4 will fail or return the SA of a freshly-recreated replacement pod. Read the SA from the snapshot captured in step 2 so blast-radius auditing stays executable after containment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="SECURITY-RUNBOOK.md">
<violation number="1" location="SECURITY-RUNBOOK.md:356">
P1: Replace the non-portable `\s` with `[[:space:]]` in the grep fallback so the command works correctly on both GNU and BSD/macOS systems.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
BSD grep (default on macOS) does not honor the `\s` Perl-style shorthand inside `-E` patterns. Switch to `[[:space:]]` so the fallback works identically on GNU and BSD/macOS systems. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
srt0422
added a commit
to srt0422/.github
that referenced
this pull request
May 22, 2026
Documents the inaugural Shai-Hulud-class tabletop exercise: an injected "eliza-allora-plugin was published with a postinstall payload yesterday at 4pm" scenario that walks the team end-to-end through the SECURITY-RUNBOOK (DEVOP-571). The doc is operational, not a writeup. It contains: * The injected scenario, including the specific exfil mechanics, the IOC discovery timeline, and the T+0 trigger. * Pre-assigned roles (incident lead, communicator, executor, BE rep, FE rep, founder observer-only) with explicit don't-skip-a-role rule. * Six phases keyed to runbook sections, each with a target elapsed time and explicit success/failure modes the facilitator watches for. * The 30-minute time-to-clean-republish target broken into 4 phases (T+5 / T+10 / T+20 / T+30) so participants can self-check progress mid-exercise. * A debrief script (6 questions, in order) that produces ticket inputs verbatim from the team's own language. * Output checklist for the facilitator (Linear tickets, runbook PR, lessons-learned section update, next-year calendar invite). * Notes-from-runbook-author section identifying the three seams in the runbook that the exercise should specifically stress. The exercise itself is a team activity and is NOT considered complete until the run + debrief actually happen. DEVOP-573 stays In Review until the facilitator schedules and runs the live session. Blocks-by: DEVOP-571 (runbook). PR allora-network#3 in this repo authors the runbook; this PR cross-references it. Refs: https://linear.app/alloralabs/issue/DEVOP-573 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Org-wide incident response runbook for Shai-Hulud-class supply-chain compromise. Lives in the
.githuborg repo so it surfaces on every repo's Security tab.What landed
SECURITY-RUNBOOK.mdat the repo root, covering all the required scenarios:Style is operational, not compliance-flavored: short imperative steps, explicit owner per step, each scenario sectioned as Stop the bleed / Audit blast radius / Restore service / Close-out so the on-call can skim during an actual incident.
Linear
https://linear.app/alloralabs/issue/DEVOP-571
Test plan
🤖 Generated with Claude Code
Summary by cubic
Adds an org-wide incident response runbook for Shai-Hulud–class supply‑chain compromises. Lives in
allora-network/.githubasSECURITY-RUNBOOK.mdso it shows on every repo’s Security tab and satisfies DEVOP-571.New Features
gh,cosign, andkubectl.Bug Fixes
[[:space:]]to work on BSD/macOS grep.Written for commit 6574da0. Summary will update on new commits.