Skip to content

duranalberto/aiops-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SECURITY.md — Security & Compliance Design

Project: AI Ops Agent Platform
Version: 1.0
Status: Draft
Last updated: 2025-05-13


1. Threat Model

1.1 Assets to Protect

Asset Classification Impact if Compromised
Kubernetes API access (read) High Attacker can enumerate all workloads, configs, and pod IPs
Prometheus metrics data Medium Reveals service topology and traffic patterns
Runbook content Medium Internal operational procedures exposed
LLM system prompt Medium Agent persona and guardrails bypassed
User conversation history High Potentially contains sensitive operational context
PostgreSQL audit logs High Evidence tampering, regulatory non-compliance
Anthropic API key Critical Unlimited LLM spend; reputational damage
Keycloak admin credentials Critical Full identity system takeover

1.2 Threat Actors

Actor Capability Likely Vector
External attacker Low-Medium Public-facing Web UI / API injection
Malicious insider Medium-High Abuse of authorized access; privilege escalation
Compromised CI/CD Medium Injecting malicious container images
Prompt injection via metric data Low-Medium Crafted metric labels containing LLM instructions

1.3 Attack Surface

External surface:
  - Kong API Gateway (HTTPS, public or internal network)
  - Web UI (HTTPS)
  - Slack webhook receiver

Internal surface:
  - FastAPI service (cluster-internal)
  - Prometheus, Qdrant, Redis, PostgreSQL (cluster-internal)
  - Kubernetes API server (cluster-internal, mTLS)
  - Anthropic API (outbound HTTPS to external)

2. Authentication & Authorization

2.1 User Authentication (Keycloak OIDC)

All user-facing interfaces authenticate via Keycloak. The flow:

1. User accesses Web UI / REST API / CLI
2. Redirected to Keycloak login (or CLI device flow)
3. Keycloak issues JWT access token (RS256, 1-hour TTL)
4. Token presented to Kong API Gateway on every request
5. Kong validates token signature against Keycloak JWKS endpoint
6. Kong forwards validated claims to FastAPI in X-User-* headers

The JWT contains:

  • sub: unique user ID
  • preferred_username: display name
  • aiops_namespaces: list of authorized Kubernetes namespaces (custom claim, set by Keycloak mapper)
  • aiops_roles: list of platform roles (viewer, analyst, admin)

2.2 Role-Based Access Control

Role Query metrics Query K8s API Receive recommendations Trigger scheduled jobs
viewer Own namespaces only Own namespaces only No No
analyst Own namespaces only Own namespaces only Yes No
admin All namespaces All namespaces Yes Yes

The namespace scope in the JWT is enforced at two layers:

  1. Tool Router (application-level): injects namespace_filter into every tool call.
  2. Kubernetes ServiceAccount (infrastructure-level): the read-only ClusterRole covers all namespaces, but the Tool Router filters results before they reach the agent.

2.3 Service-to-Service Authentication

Internal services communicate using Kubernetes ServiceAccount tokens (projected volumes, auto-rotated) and mutual TLS enforced by Kong. No service accepts plain HTTP.


3. Secret Management

3.1 Secret Storage Strategy

Secret Storage Rotation
Anthropic API key Kubernetes Secret + optional HashiCorp Vault Quarterly or on suspected exposure
Prometheus auth token Kubernetes Secret Quarterly
Alertmanager auth token Kubernetes Secret Quarterly
Qdrant API key Kubernetes Secret Quarterly
PostgreSQL password Kubernetes Secret Quarterly
Redis password Kubernetes Secret Quarterly
Keycloak admin credentials Kubernetes Secret + external KMS On breach or personnel change
Kong admin token Kubernetes Secret Quarterly

3.2 Secret Access Rules

  • Secrets are mounted as files into containers via volumeMounts, never as environment variables in the container spec (prevents exposure via /proc/{pid}/environ).
  • Each service has its own Kubernetes ServiceAccount; secrets are bound to that ServiceAccount via RBAC.
  • No secret is ever logged, printed, or included in error messages.

3.3 Optional HashiCorp Vault Integration

For environments where Vault is available, secrets are fetched dynamically using the Vault Agent Injector sidecar pattern:

annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "aiops-api"
  vault.hashicorp.com/agent-inject-secret-anthropic: "secret/aiops/anthropic-key"

4. Data Security

4.1 Data in Transit

  • All external HTTPS uses TLS 1.3 minimum.
  • All intra-cluster communication uses mTLS (Kong enforces this at the gateway; Kubernetes Network Policies restrict direct pod-to-pod communication).
  • Prometheus scrape targets are configured with TLS client certificates where the monitoring target supports it.

4.2 Data at Rest

Store Encryption
PostgreSQL AES-256 via storage-layer encryption (Kubernetes PV encryption or dm-crypt)
Redis AES-256 via storage-layer encryption; Redis AUTH password required
Qdrant AES-256 via storage-layer encryption
MinIO Server-side encryption (AES-256-GCM) with per-object keys

4.3 Data Minimization

  • The agent does not log raw metric values to PostgreSQL — only the PromQL query string and result hash.
  • Conversation history stored in PostgreSQL is subject to a 90-day retention policy enforced by a nightly PostgreSQL job.
  • Redis session data has a 2-hour TTL; after expiry, the session is purged automatically.
  • No PII is extracted from metric labels or Kubernetes annotations by design.

4.4 Prompt Injection Mitigation

External data (metrics, k8s API output, runbook content) is injected into the LLM context using structured XML tags. The system prompt explicitly instructs the model to treat tag content as data, not instructions.

Additional mitigations:

  • Metric label values are truncated to 256 characters before injection.
  • Kubernetes resource names matching common injection patterns (; DROP, <script>, IGNORE PREVIOUS INSTRUCTIONS) are flagged and sanitized before being included in prompts.
  • All tool outputs are validated against expected JSON schemas before injection.

5. Network Policies

# Kubernetes NetworkPolicy: restrict aiops-api egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aiops-api-egress
  namespace: aiops-system
spec:
  podSelector:
    matchLabels:
      app: aiops-api
  policyTypes:
    - Egress
  egress:
    # Allow: Prometheus (monitoring namespace)
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 9090
    # Allow: Kubernetes API server
    - to:
        - ipBlock:
            cidr: 10.0.0.1/32  # Kubernetes API server IP
      ports:
        - port: 6443
    # Allow: Qdrant, Redis, PostgreSQL (same namespace)
    - to:
        - podSelector: {}
      ports:
        - port: 6333  # Qdrant
        - port: 6379  # Redis
        - port: 5432  # PostgreSQL
    # Allow: Anthropic API (HTTPS egress)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16
      ports:
        - port: 443

6. Audit Logging

6.1 What is Logged

Every event is written to the PostgreSQL audit_events table:

CREATE TABLE audit_events (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type    TEXT NOT NULL,  -- 'query', 'tool_call', 'recommendation', 'auth_failure'
    user_id       TEXT NOT NULL,
    session_id    TEXT,
    timestamp     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    agent_version TEXT NOT NULL,  -- git SHA of agent config
    -- Query events
    user_message  TEXT,
    tool_calls    JSONB,          -- array of {tool, inputs_hash, result_hash}
    response_hash TEXT,           -- SHA-256 of full response text
    -- Auth / security events
    namespace     TEXT,
    ip_address    INET,
    failure_reason TEXT,
    -- Metadata
    latency_ms    INTEGER,
    token_count   INTEGER
);

6.2 What is NOT Logged

  • Raw LLM response text (only its hash, to verify integrity without storing sensitive data).
  • Prometheus query results (only the PromQL query string).
  • Kubernetes resource content (only the resource type and namespace queried).
  • User credentials, tokens, or passwords.

6.3 Log Integrity

  • Audit log rows are append-only; no UPDATE or DELETE permissions are granted to the application role.
  • A separate database role (audit_archiver) exports logs to MinIO nightly, where objects are immutable (object lock enabled).
  • Log exports are SHA-256 signed using a key stored in the external KMS.

7. Incident Response

7.1 Compromise Indicators

Signal Action
Unusual spike in Anthropic API token usage Rotate API key immediately; review audit logs
Agent querying namespaces outside user's scope Bug or auth bypass — take API server offline; investigate
Prompt injection pattern detected in tool result Log as security event; block the offending metric label
Failed login spike in Keycloak Rate-limit source IP; alert security team

7.2 Key Contacts

Responsibility Contact
Platform security Platform Security team
LLM / AI concerns AI Safety lead
Infrastructure breach Site Reliability team
Data breach notification Legal / DPO

7.3 API Key Compromise Runbook

1. Rotate the Anthropic API key via the Anthropic console.
2. Update the Kubernetes Secret: kubectl create secret generic anthropic-key \
     --from-literal=key=<new_key> --dry-run=client -o yaml | kubectl apply -f -
3. Rolling restart the aiops-api deployment:
     kubectl rollout restart deployment/aiops-api -n aiops-system
4. Review audit_events for the prior 24h, looking for unusual query patterns.
5. Notify the Anthropic security team if suspected credential theft.

8. Compliance Considerations

Requirement Implementation
Data residency All components deploy within a single Kubernetes cluster; choose a region that meets residency requirements
GDPR / data minimization 90-day audit log retention; no PII stored in metric labels by design
SOC 2 Type II (audit trail) Append-only audit log; nightly signed export to MinIO
Least privilege Read-only Kubernetes ServiceAccount; per-service secret RBAC
Vulnerability management Container images scanned by Trivy on each CI build; critical CVEs block deployment

About

No description, website, or topics provided.

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors