Project: AI Ops Agent Platform
Version: 1.0
Status: Draft
Last updated: 2025-05-13
| Asset | Classification | Impact if Compromised |
|---|---|---|
| Kubernetes API access (read) | High | Attacker can enumerate all workloads, configs, and pod IPs |
| Prometheus metrics data | Medium | Reveals service topology and traffic patterns |
| Runbook content | Medium | Internal operational procedures exposed |
| LLM system prompt | Medium | Agent persona and guardrails bypassed |
| User conversation history | High | Potentially contains sensitive operational context |
| PostgreSQL audit logs | High | Evidence tampering, regulatory non-compliance |
| Anthropic API key | Critical | Unlimited LLM spend; reputational damage |
| Keycloak admin credentials | Critical | Full identity system takeover |
| Actor | Capability | Likely Vector |
|---|---|---|
| External attacker | Low-Medium | Public-facing Web UI / API injection |
| Malicious insider | Medium-High | Abuse of authorized access; privilege escalation |
| Compromised CI/CD | Medium | Injecting malicious container images |
| Prompt injection via metric data | Low-Medium | Crafted metric labels containing LLM instructions |
External surface:
- Kong API Gateway (HTTPS, public or internal network)
- Web UI (HTTPS)
- Slack webhook receiver
Internal surface:
- FastAPI service (cluster-internal)
- Prometheus, Qdrant, Redis, PostgreSQL (cluster-internal)
- Kubernetes API server (cluster-internal, mTLS)
- Anthropic API (outbound HTTPS to external)
All user-facing interfaces authenticate via Keycloak. The flow:
1. User accesses Web UI / REST API / CLI
2. Redirected to Keycloak login (or CLI device flow)
3. Keycloak issues JWT access token (RS256, 1-hour TTL)
4. Token presented to Kong API Gateway on every request
5. Kong validates token signature against Keycloak JWKS endpoint
6. Kong forwards validated claims to FastAPI in X-User-* headers
The JWT contains:
sub: unique user IDpreferred_username: display nameaiops_namespaces: list of authorized Kubernetes namespaces (custom claim, set by Keycloak mapper)aiops_roles: list of platform roles (viewer,analyst,admin)
| Role | Query metrics | Query K8s API | Receive recommendations | Trigger scheduled jobs |
|---|---|---|---|---|
viewer |
Own namespaces only | Own namespaces only | No | No |
analyst |
Own namespaces only | Own namespaces only | Yes | No |
admin |
All namespaces | All namespaces | Yes | Yes |
The namespace scope in the JWT is enforced at two layers:
- Tool Router (application-level): injects
namespace_filterinto every tool call. - Kubernetes ServiceAccount (infrastructure-level): the read-only ClusterRole covers all namespaces, but the Tool Router filters results before they reach the agent.
Internal services communicate using Kubernetes ServiceAccount tokens (projected volumes, auto-rotated) and mutual TLS enforced by Kong. No service accepts plain HTTP.
| Secret | Storage | Rotation |
|---|---|---|
| Anthropic API key | Kubernetes Secret + optional HashiCorp Vault | Quarterly or on suspected exposure |
| Prometheus auth token | Kubernetes Secret | Quarterly |
| Alertmanager auth token | Kubernetes Secret | Quarterly |
| Qdrant API key | Kubernetes Secret | Quarterly |
| PostgreSQL password | Kubernetes Secret | Quarterly |
| Redis password | Kubernetes Secret | Quarterly |
| Keycloak admin credentials | Kubernetes Secret + external KMS | On breach or personnel change |
| Kong admin token | Kubernetes Secret | Quarterly |
- Secrets are mounted as files into containers via
volumeMounts, never as environment variables in the container spec (prevents exposure via/proc/{pid}/environ). - Each service has its own Kubernetes ServiceAccount; secrets are bound to that ServiceAccount via RBAC.
- No secret is ever logged, printed, or included in error messages.
For environments where Vault is available, secrets are fetched dynamically using the Vault Agent Injector sidecar pattern:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "aiops-api"
vault.hashicorp.com/agent-inject-secret-anthropic: "secret/aiops/anthropic-key"- All external HTTPS uses TLS 1.3 minimum.
- All intra-cluster communication uses mTLS (Kong enforces this at the gateway; Kubernetes Network Policies restrict direct pod-to-pod communication).
- Prometheus scrape targets are configured with TLS client certificates where the monitoring target supports it.
| Store | Encryption |
|---|---|
| PostgreSQL | AES-256 via storage-layer encryption (Kubernetes PV encryption or dm-crypt) |
| Redis | AES-256 via storage-layer encryption; Redis AUTH password required |
| Qdrant | AES-256 via storage-layer encryption |
| MinIO | Server-side encryption (AES-256-GCM) with per-object keys |
- The agent does not log raw metric values to PostgreSQL — only the PromQL query string and result hash.
- Conversation history stored in PostgreSQL is subject to a 90-day retention policy enforced by a nightly PostgreSQL job.
- Redis session data has a 2-hour TTL; after expiry, the session is purged automatically.
- No PII is extracted from metric labels or Kubernetes annotations by design.
External data (metrics, k8s API output, runbook content) is injected into the LLM context using structured XML tags. The system prompt explicitly instructs the model to treat tag content as data, not instructions.
Additional mitigations:
- Metric label values are truncated to 256 characters before injection.
- Kubernetes resource names matching common injection patterns (
; DROP,<script>,IGNORE PREVIOUS INSTRUCTIONS) are flagged and sanitized before being included in prompts. - All tool outputs are validated against expected JSON schemas before injection.
# Kubernetes NetworkPolicy: restrict aiops-api egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: aiops-api-egress
namespace: aiops-system
spec:
podSelector:
matchLabels:
app: aiops-api
policyTypes:
- Egress
egress:
# Allow: Prometheus (monitoring namespace)
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 9090
# Allow: Kubernetes API server
- to:
- ipBlock:
cidr: 10.0.0.1/32 # Kubernetes API server IP
ports:
- port: 6443
# Allow: Qdrant, Redis, PostgreSQL (same namespace)
- to:
- podSelector: {}
ports:
- port: 6333 # Qdrant
- port: 6379 # Redis
- port: 5432 # PostgreSQL
# Allow: Anthropic API (HTTPS egress)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- port: 443Every event is written to the PostgreSQL audit_events table:
CREATE TABLE audit_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type TEXT NOT NULL, -- 'query', 'tool_call', 'recommendation', 'auth_failure'
user_id TEXT NOT NULL,
session_id TEXT,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
agent_version TEXT NOT NULL, -- git SHA of agent config
-- Query events
user_message TEXT,
tool_calls JSONB, -- array of {tool, inputs_hash, result_hash}
response_hash TEXT, -- SHA-256 of full response text
-- Auth / security events
namespace TEXT,
ip_address INET,
failure_reason TEXT,
-- Metadata
latency_ms INTEGER,
token_count INTEGER
);- Raw LLM response text (only its hash, to verify integrity without storing sensitive data).
- Prometheus query results (only the PromQL query string).
- Kubernetes resource content (only the resource type and namespace queried).
- User credentials, tokens, or passwords.
- Audit log rows are append-only; no
UPDATEorDELETEpermissions are granted to the application role. - A separate database role (
audit_archiver) exports logs to MinIO nightly, where objects are immutable (object lock enabled). - Log exports are SHA-256 signed using a key stored in the external KMS.
| Signal | Action |
|---|---|
| Unusual spike in Anthropic API token usage | Rotate API key immediately; review audit logs |
| Agent querying namespaces outside user's scope | Bug or auth bypass — take API server offline; investigate |
| Prompt injection pattern detected in tool result | Log as security event; block the offending metric label |
| Failed login spike in Keycloak | Rate-limit source IP; alert security team |
| Responsibility | Contact |
|---|---|
| Platform security | Platform Security team |
| LLM / AI concerns | AI Safety lead |
| Infrastructure breach | Site Reliability team |
| Data breach notification | Legal / DPO |
1. Rotate the Anthropic API key via the Anthropic console.
2. Update the Kubernetes Secret: kubectl create secret generic anthropic-key \
--from-literal=key=<new_key> --dry-run=client -o yaml | kubectl apply -f -
3. Rolling restart the aiops-api deployment:
kubectl rollout restart deployment/aiops-api -n aiops-system
4. Review audit_events for the prior 24h, looking for unusual query patterns.
5. Notify the Anthropic security team if suspected credential theft.
| Requirement | Implementation |
|---|---|
| Data residency | All components deploy within a single Kubernetes cluster; choose a region that meets residency requirements |
| GDPR / data minimization | 90-day audit log retention; no PII stored in metric labels by design |
| SOC 2 Type II (audit trail) | Append-only audit log; nightly signed export to MinIO |
| Least privilege | Read-only Kubernetes ServiceAccount; per-service secret RBAC |
| Vulnerability management | Container images scanned by Trivy on each CI build; critical CVEs block deployment |