SECURITY.md

Project: AI Ops Agent Platform
Version: 1.0
Status: Draft
Last updated: 2025-05-13

1. Threat Model

1.1 Assets to Protect

Asset	Classification	Impact if Compromised
Kubernetes API access (read)	High	Attacker can enumerate all workloads, configs, and pod IPs
Prometheus metrics data	Medium	Reveals service topology and traffic patterns
Runbook content	Medium	Internal operational procedures exposed
LLM system prompt	Medium	Agent persona and guardrails bypassed
User conversation history	High	Potentially contains sensitive operational context
PostgreSQL audit logs	High	Evidence tampering, regulatory non-compliance
Anthropic API key	Critical	Unlimited LLM spend; reputational damage
Keycloak admin credentials	Critical	Full identity system takeover

1.2 Threat Actors

Actor	Capability	Likely Vector
External attacker	Low-Medium	Public-facing Web UI / API injection
Malicious insider	Medium-High	Abuse of authorized access; privilege escalation
Compromised CI/CD	Medium	Injecting malicious container images
Prompt injection via metric data	Low-Medium	Crafted metric labels containing LLM instructions

1.3 Attack Surface

External surface:
  - Kong API Gateway (HTTPS, public or internal network)
  - Web UI (HTTPS)
  - Slack webhook receiver

Internal surface:
  - FastAPI service (cluster-internal)
  - Prometheus, Qdrant, Redis, PostgreSQL (cluster-internal)
  - Kubernetes API server (cluster-internal, mTLS)
  - Anthropic API (outbound HTTPS to external)

2. Authentication & Authorization

2.1 User Authentication (Keycloak OIDC)

All user-facing interfaces authenticate via Keycloak. The flow:

1. User accesses Web UI / REST API / CLI
2. Redirected to Keycloak login (or CLI device flow)
3. Keycloak issues JWT access token (RS256, 1-hour TTL)
4. Token presented to Kong API Gateway on every request
5. Kong validates token signature against Keycloak JWKS endpoint
6. Kong forwards validated claims to FastAPI in X-User-* headers

The JWT contains:

sub: unique user ID
preferred_username: display name
aiops_namespaces: list of authorized Kubernetes namespaces (custom claim, set by Keycloak mapper)
aiops_roles: list of platform roles (viewer, analyst, admin)

2.2 Role-Based Access Control

Role	Query metrics	Query K8s API	Receive recommendations	Trigger scheduled jobs
`viewer`	Own namespaces only	Own namespaces only	No	No
`analyst`	Own namespaces only	Own namespaces only	Yes	No
`admin`	All namespaces	All namespaces	Yes	Yes

The namespace scope in the JWT is enforced at two layers:

Tool Router (application-level): injects namespace_filter into every tool call.
Kubernetes ServiceAccount (infrastructure-level): the read-only ClusterRole covers all namespaces, but the Tool Router filters results before they reach the agent.

2.3 Service-to-Service Authentication

Internal services communicate using Kubernetes ServiceAccount tokens (projected volumes, auto-rotated) and mutual TLS enforced by Kong. No service accepts plain HTTP.

3. Secret Management

3.1 Secret Storage Strategy

Secret	Storage	Rotation
Anthropic API key	Kubernetes Secret + optional HashiCorp Vault	Quarterly or on suspected exposure
Prometheus auth token	Kubernetes Secret	Quarterly
Alertmanager auth token	Kubernetes Secret	Quarterly
Qdrant API key	Kubernetes Secret	Quarterly
PostgreSQL password	Kubernetes Secret	Quarterly
Redis password	Kubernetes Secret	Quarterly
Keycloak admin credentials	Kubernetes Secret + external KMS	On breach or personnel change
Kong admin token	Kubernetes Secret	Quarterly

3.2 Secret Access Rules

Secrets are mounted as files into containers via volumeMounts, never as environment variables in the container spec (prevents exposure via /proc/{pid}/environ).
Each service has its own Kubernetes ServiceAccount; secrets are bound to that ServiceAccount via RBAC.
No secret is ever logged, printed, or included in error messages.

3.3 Optional HashiCorp Vault Integration

For environments where Vault is available, secrets are fetched dynamically using the Vault Agent Injector sidecar pattern:

annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "aiops-api"
  vault.hashicorp.com/agent-inject-secret-anthropic: "secret/aiops/anthropic-key"

4. Data Security

4.1 Data in Transit

All external HTTPS uses TLS 1.3 minimum.
All intra-cluster communication uses mTLS (Kong enforces this at the gateway; Kubernetes Network Policies restrict direct pod-to-pod communication).
Prometheus scrape targets are configured with TLS client certificates where the monitoring target supports it.

4.2 Data at Rest

Store	Encryption
PostgreSQL	AES-256 via storage-layer encryption (Kubernetes PV encryption or dm-crypt)
Redis	AES-256 via storage-layer encryption; Redis AUTH password required
Qdrant	AES-256 via storage-layer encryption
MinIO	Server-side encryption (AES-256-GCM) with per-object keys

4.3 Data Minimization

The agent does not log raw metric values to PostgreSQL — only the PromQL query string and result hash.
Conversation history stored in PostgreSQL is subject to a 90-day retention policy enforced by a nightly PostgreSQL job.
Redis session data has a 2-hour TTL; after expiry, the session is purged automatically.
No PII is extracted from metric labels or Kubernetes annotations by design.

4.4 Prompt Injection Mitigation

External data (metrics, k8s API output, runbook content) is injected into the LLM context using structured XML tags. The system prompt explicitly instructs the model to treat tag content as data, not instructions.

Additional mitigations:

Metric label values are truncated to 256 characters before injection.
Kubernetes resource names matching common injection patterns (; DROP, <script>, IGNORE PREVIOUS INSTRUCTIONS) are flagged and sanitized before being included in prompts.
All tool outputs are validated against expected JSON schemas before injection.

5. Network Policies

# Kubernetes NetworkPolicy: restrict aiops-api egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aiops-api-egress
  namespace: aiops-system
spec:
  podSelector:
    matchLabels:
      app: aiops-api
  policyTypes:
    - Egress
  egress:
    # Allow: Prometheus (monitoring namespace)
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 9090
    # Allow: Kubernetes API server
    - to:
        - ipBlock:
            cidr: 10.0.0.1/32  # Kubernetes API server IP
      ports:
        - port: 6443
    # Allow: Qdrant, Redis, PostgreSQL (same namespace)
    - to:
        - podSelector: {}
      ports:
        - port: 6333  # Qdrant
        - port: 6379  # Redis
        - port: 5432  # PostgreSQL
    # Allow: Anthropic API (HTTPS egress)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - port: 443

6. Audit Logging

6.1 What is Logged

Every event is written to the PostgreSQL audit_events table:

CREATE TABLE audit_events (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type    TEXT NOT NULL,  -- 'query', 'tool_call', 'recommendation', 'auth_failure'
    user_id       TEXT NOT NULL,
    session_id    TEXT,
    timestamp     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    agent_version TEXT NOT NULL,  -- git SHA of agent config
    -- Query events
    user_message  TEXT,
    tool_calls    JSONB,          -- array of {tool, inputs_hash, result_hash}
    response_hash TEXT,           -- SHA-256 of full response text
    -- Auth / security events
    namespace     TEXT,
    ip_address    INET,
    failure_reason TEXT,
    -- Metadata
    latency_ms    INTEGER,
    token_count   INTEGER
);

6.2 What is NOT Logged

Raw LLM response text (only its hash, to verify integrity without storing sensitive data).
Prometheus query results (only the PromQL query string).
Kubernetes resource content (only the resource type and namespace queried).
User credentials, tokens, or passwords.

6.3 Log Integrity

Audit log rows are append-only; no UPDATE or DELETE permissions are granted to the application role.
A separate database role (audit_archiver) exports logs to MinIO nightly, where objects are immutable (object lock enabled).
Log exports are SHA-256 signed using a key stored in the external KMS.

7. Incident Response

7.1 Compromise Indicators

Signal	Action
Unusual spike in Anthropic API token usage	Rotate API key immediately; review audit logs
Agent querying namespaces outside user's scope	Bug or auth bypass — take API server offline; investigate
Prompt injection pattern detected in tool result	Log as security event; block the offending metric label
Failed login spike in Keycloak	Rate-limit source IP; alert security team

7.2 Key Contacts

Responsibility	Contact
Platform security	Platform Security team
LLM / AI concerns	AI Safety lead
Infrastructure breach	Site Reliability team
Data breach notification	Legal / DPO

7.3 API Key Compromise Runbook

1. Rotate the Anthropic API key via the Anthropic console.
2. Update the Kubernetes Secret: kubectl create secret generic anthropic-key \
     --from-literal=key=<new_key> --dry-run=client -o yaml | kubectl apply -f -
3. Rolling restart the aiops-api deployment:
     kubectl rollout restart deployment/aiops-api -n aiops-system
4. Review audit_events for the prior 24h, looking for unusual query patterns.
5. Notify the Anthropic security team if suspected credential theft.

8. Compliance Considerations

Requirement	Implementation
Data residency	All components deploy within a single Kubernetes cluster; choose a region that meets residency requirements
GDPR / data minimization	90-day audit log retention; no PII stored in metric labels by design
SOC 2 Type II (audit trail)	Append-only audit log; nightly signed export to MinIO
Least privilege	Read-only Kubernetes ServiceAccount; per-service secret RBAC
Vulnerability management	Container images scanned by Trivy on each CI build; critical CVEs block deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security

SECURITY.md — Security & Compliance Design

1. Threat Model

1.1 Assets to Protect

1.2 Threat Actors

1.3 Attack Surface

2. Authentication & Authorization

2.1 User Authentication (Keycloak OIDC)

2.2 Role-Based Access Control

2.3 Service-to-Service Authentication

3. Secret Management

3.1 Secret Storage Strategy

3.2 Secret Access Rules

3.3 Optional HashiCorp Vault Integration

4. Data Security

4.1 Data in Transit

4.2 Data at Rest

4.3 Data Minimization

4.4 Prompt Injection Mitigation

5. Network Policies

6. Audit Logging

6.1 What is Logged

6.2 What is NOT Logged

6.3 Log Integrity

7. Incident Response

7.1 Compromise Indicators

7.2 Key Contacts

7.3 API Key Compromise Runbook

8. Compliance Considerations

There aren't any published security advisories

Security: duranalberto/aiops-agent

Security

SECURITY.md

SECURITY.md — Security & Compliance Design

1. Threat Model

1.1 Assets to Protect

1.2 Threat Actors

1.3 Attack Surface

2. Authentication & Authorization

2.1 User Authentication (Keycloak OIDC)

2.2 Role-Based Access Control

2.3 Service-to-Service Authentication

3. Secret Management

3.1 Secret Storage Strategy

3.2 Secret Access Rules

3.3 Optional HashiCorp Vault Integration

4. Data Security

4.1 Data in Transit

4.2 Data at Rest

4.3 Data Minimization

4.4 Prompt Injection Mitigation

5. Network Policies

6. Audit Logging

6.1 What is Logged

6.2 What is NOT Logged

6.3 Log Integrity

7. Incident Response

7.1 Compromise Indicators

7.2 Key Contacts

7.3 API Key Compromise Runbook

8. Compliance Considerations

There aren't any published security advisories