AI security posture management (AI-SPM) is a comprehensive approach to maintaining the security and integrity of artificial intelligence (AI) and machine learning (ML) systems. It involves continuous monitoring, assessment, and improvement of the security posture of AI models, data, and infrastructure. AI-SPM includes identifying and addressing vulnerabilities, misconfigurations, and potential risks associated with AI adoption, as well as ensuring compliance with relevant privacy and security regulations.
This opensource project dedicated to implementing Enterprise level AI-SPM. By doing so organizations can proactively protect their AI systems from threats, minimize data exposure, and maintain the trustworthiness of their AI applications (agents, mpc servers, models and more).
Your organization is putting everything it’s got into AI applications—are you prepared to secure them?
Before you answer, think about these specific questions:
Can you identify all the shadow AI (including AI models, agents and associated resources) that's in your environment?
Are you effectively securing AI data to prevent data poisoning, bias and compliance breaches?
Do you know how to prioritize critical AI risks with context?
Are you confident that you can detect and respond quickly to suspicious activity in AI pipelines?
If you answered “not sure,” or “no” to even one of those questions, then you should take a closer look in to this project. It’s the way to see the current state of your AI ecosystem security.
Discover your AI models , agents, and associated resources security. Identify risks across AI application supply chains/piplines and agents - that can lead to data exfiltration and misuse of resources. Implement proper governance controls around AI usage.
- Quick how to deploy 101
- Project Information
- Platform at a Glance
- Features
- Roadmap
- Installation
- Local SSO with Keycloak + Traefik
- Troubleshooting
- Environment Reference
- Usage
- Tech Stack
- Architecture Overview
- Contributing
- 👤 Author: Dany Shapiro
- 📦 Version: 1.0.0
- 📄 License: Apache-2.0
- 📂 Repository: https://github.com/dshapi/AI-SPM
Get Orbyx AI SPM running locally in a few simple steps. Prerequisites:
brew install mkcert istioctl
mkcert -installsudo apt-get update
sudo apt-get install -y libnss3-tools # mkcert needs this to trust the CA in browsers (SSL suport)sudo dnf install -y nss-toolscurl -fsSLo /tmp/mkcert "https://github.com/FiloSottile/mkcert/releases/latest/download/mkcert-v1.4.4-linux-amd64"
sudo install -m 0755 /tmp/mkcert /usr/local/bin/mkcert
curl -fsSL https://istio.io/downloadIstio | sh -
sudo install -m 0755 istio-*/bin/istioctl /usr/local/bin/istioctl
mkcert -installIf you're on arm64 Linux, swap linux-amd64 → linux-arm64 in the mkcert URL.
clone the repo.
Run from /Users/danyshapiro/PycharmProjects/AISPM. Each step is idempotent.
export KUBECONFIG=$HOME/.kube/kind-aispm.yaml
./deploy/scripts/kind-cluster.sh init # cluster + registry + metrics-server
./deploy/scripts/kind-storage.sh up # MinIO + flink bucket
./deploy/scripts/kind-databases-ha.sh up # CNPG + Bitnami Redis Sentinel
# Push AISPM service images to the local registry the kind nodes pull from:
docker compose build
docker images --format '{{.Repository}}' | grep '^aispm-' | sort -u | while read img; do
docker tag "${img}:latest" "localhost:5001/${img}:latest"
docker push "localhost:5001/${img}:latest"
done
# Alias for chart templates that hardcode `local-path`:
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
EOF
SKIP_FALCO=1 SKIP_KYVERNO=1 \
VALUES_EXTRA=deploy/helm/aispm/values.dev-multinode.yaml \
./deploy/scripts/bootstrap-cluster.shEnd-to-end on a fresh machine: about 20 minutes. Subsequent runs that only re-deploy the AISPM chart take about 5 minutes.
Once the bootstrap completes, navigate to:
| What | URL |
|---|---|
| Chat UI | https://aispm.local:30443 |
| Admin Portal | https://aispm.local:30443/admin/ |
Click Sign In on either page — a demo JWT is minted automatically, no account needed.
That's it! You're up and running.
| Microservices | 16 |
| OPA Policies | 6 |
| Kafka Topics | 12+ |
| Admin User Interface | 1 ( Admin portal ) |
| Supported Models | Anthropic / OpenAI-compatible endpoint / 3rd party model imprort |
| Compliance Framework | NIST AI RMF (GOVERN / MAP / MEASURE / MANAGE) |
| Feature | Description | Component |
|---|---|---|
| RS256 JWT Auth | Every API request validated against platform-generated RSA key pair. Tokens are short-lived and audience-scoped. | CPM API |
| Role-Based Access | Roles (spm:admin, spm:auditor, user) enforced on all SPM endpoints via OPA policy evaluation per request. |
OPA / CPM API |
| Dev Token Endpoint | /dev-token generates 24-hour demo JWTs signed by the platform's own private key — no external IdP needed for development. |
CPM API |
| Per-User Rate Limiting | Sliding window in Redis: 60 req/min with burst allowance of 10. Returns 429 with retry headers. |
CPM API / Redis |
| Tenant Isolation | All events, topics, and audit records are scoped by tenant_id. Multi-tenant from day one. |
All services |
| Feature | Description | Component |
|---|---|---|
| Guard Model Screening | Every prompt passes through Llama Guard 3 (8B) before reaching the LLM. Blocks harmful content with category labels. | Guard Model |
| Prompt Injection Detection | Memory service scans writes for injection patterns: ignore previous instructions, act as if, override instructions etc. |
Memory Service |
| OPA Prompt Policy | Rego policy evaluates posture score, intent drift, guard verdict, and auth context. Decisions: allow / escalate / block. |
OPA |
| Posture-Based Blocking | Requests with risk score ≥ 0.70 are auto-blocked. 0.30–0.70 escalated. Below 0.30 allowed. | Policy Decider |
| Intent Drift Detection | Jaccard similarity tracks deviation from session baseline. High drift triggers escalation. | Flink CEP |
| Feature | Description | Component |
|---|---|---|
| Secret Scanning | Regex detects API keys (sk-, ghp_, AKIA*), Bearer tokens, passwords in LLM responses. |
CPM API |
| PII Detection | Detects email addresses, US SSNs, and phone numbers in responses. Triggers redaction or block via OPA output policy. | CPM API |
| Output Redaction | Matched secrets and PII replaced with [REDACTED-SECRET] / [REDACTED-PII] before reaching the user. |
CPM API |
| OPA Output Policy | Second-pass policy evaluation on LLM output. Considers contains_secret, contains_pii, and LLM verdict. |
OPA |
| Output Guard LLM | Optional second-pass LLM semantic scan for subtle policy violations not caught by regex. | Output Guard |
| Feature | Description | Component |
|---|---|---|
| Model Registry | Full lifecycle: register → approve → freeze → retire. Tracked with provider, version, risk tier, and approver. | SPM API / DB |
| Model Gate | CPM API checks SPM approval status before every LLM call. Unapproved models return 403. Fail-closed by design. |
CPM API / OPA |
| Risk Tier Classification | Models classified as low / medium / high risk. Influences OPA policy thresholds and compliance evidence requirements. |
SPM API |
| Multi-Model Support | Swap between Claude Haiku, Sonnet, Opus via ANTHROPIC_MODEL env var. Architecture supports any OpenAI-compatible endpoint. |
CPM API |
| Model Freeze | Freeze controller suspends a model from serving traffic in real time via Kafka freeze_control topic. |
Freeze Controller |
| Feature | Description | Component |
|---|---|---|
| Baseline vs Current Comparison | Compares approved agent posture snapshots with current runtime state to detect model, tool, identity, runtime, RAG, and guardrail drift. | platform_shared.posture_drift |
| Risk-Based Reapproval | Classifies drift as low, medium, high, or critical and recommends accept, re-review, rollback, or disable actions. | SPM API / Agent Control Plane |
| Evidence Hashing | Produces stable evidence hashes for audit records and compliance reporting. | Audit / Compliance |
| Feature | Description | Component |
|---|---|---|
| Web Search | Claude autonomously searches the web via Tavily API when prompted about current events or real-time data. | CPM API / Tavily |
| Web Fetch | Claude fetches and reads any URL provided by the user. HTML cleaned with BeautifulSoup before injection into context. | CPM API |
| Tool Authorization | OPA tool_policy.rego evaluates every tool call against posture score, intent, and auth context before execution. |
OPA / Executor |
| Tool Execution Pipeline | Tool requests flow: tool_request → OPA auth → Executor → tool_result. Side-effect tools require approval. |
Executor / Agent |
| Approval Workflow | Write/send/delete tools emit to approval_request topic and await approval_result before executing. |
Executor |
| Feature | Description | Component |
|---|---|---|
| Cross-Session Memory | Conversation history stored in Redis with 30-day TTL. Claude receives last 20 turns as context on every request. | CPM API / Redis |
| Integrity Verification | Every memory write generates a SHA-256 hash. Reads verify the hash — integrity_ok=False triggers a security alert. |
Memory Service |
| Namespace Scoping | Three namespaces: session (1h TTL), longterm (30d TTL), system (24h TTL). OPA policy controls access per namespace. |
Memory Service |
| Injection Protection | Memory writes scanned for prompt injection patterns before storage. Malicious writes are rejected and audited. | Memory Service |
| Soft Delete | Memory deletes create tombstones rather than hard deleting. Audit trail preserved for forensics. | Memory Service |
| Metric | Description |
|---|---|
spm_model_risk_score |
Per-model gauge updated on every posture event. Labels: model_id, tenant_id. |
spm_enforcement_actions_total |
Counter tracking block / escalate / allow decisions. Labels: action, tenant_id. |
spm_snapshot_lag_seconds |
Seconds since last posture snapshot write. Updated every 15s by background thread. |
spm_compliance_coverage_pct |
NIST AI RMF coverage % per function. Labels: function (GOVERN, MAP, MEASURE, MANAGE, OVERALL). |
Engineering Dashboard
- Model Risk Score over time (time-series)
- Enforcement Actions total (stat)
- Snapshot Lag (gauge with thresholds)
- Model Lifecycle Status (table — name, version, status, risk tier, approver)
- Web Tool Calls — every search/fetch with user, session, exact query (table)
- Tool Type Breakdown — Search vs Fetch split (donut chart)
- Blocked Requests — guard blocks, output blocks, model gate blocks with reason (table)
Compliance Dashboard
- NIST AI RMF Coverage per function (gauge panels)
- Overall Coverage % (stat)
- Compliance Gap Table (table — control, status, evidence)
| Feature | Description | Component |
|---|---|---|
| Tamper-Evident Audit Log | All events written to Kafka audit topic and mirrored to audit_export table in PostgreSQL. ON CONFLICT DO NOTHING ensures idempotency. |
SPM Aggregator / DB |
| NIST AI RMF Alignment | Compliance evidence mapped to GOVERN, MAP, MEASURE, MANAGE functions. Coverage % computed per function. | SPM API / DB |
| MITRE ATLAS TTP Mapping | CEP maps behavioural patterns to ATLAS TTPs (e.g. AML.T0048, AML.T0051.000). Attached to security alerts. |
Flink CEP |
| Compliance Evidence | Attach evaluation results, test reports, and approval notes to each model as structured evidence records. | SPM API |
| Startup Audit Record | Platform startup writes an audit record per tenant. Baseline timestamp for forensic investigation. | Startup Orchestrator |
| Feature | Description | Component |
|---|---|---|
| Burst Detection | Tracks request volume in a 2-minute window. >5 events triggers burst alert with ATLAS TTP code. | Flink CEP |
| Sustained Volume Detection | 1-hour rolling window detects sustained high-volume usage (>15 events). | Flink CEP |
| Critical Combo Detection | Specific signal combinations (e.g. exfiltration + high posture + PII) trigger immediate critical escalation. | Flink CEP |
| Session Signal Accumulation | Signals accumulate across a session. Repeated suspicious signals compound the risk score. | Flink CEP |
| Posture Snapshot History | Risk scores snapshotted every 5 minutes per model per tenant. Rolling average over configurable N snapshots. | SPM Aggregator |
| Topic | Publisher | Consumer |
|---|---|---|
{tenant}.raw |
CPM API | Processor |
{tenant}.posture_enriched |
Processor | Policy Decider, Flink CEP, SPM Aggregator |
{tenant}.decision |
Policy Decider | Agent |
{tenant}.tool_request |
Agent / Tool Parser | Executor |
{tenant}.tool_result |
Executor | Agent |
{tenant}.audit |
All services | SPM Aggregator → audit_export |
{tenant}.memory_request |
Agent | Memory Service |
{tenant}.memory_result |
Memory Service | Agent |
{tenant}.approval_request |
Executor | (human reviewer) |
{tenant}.freeze_control |
Freeze Controller | All consumers |
| Service | Role |
|---|---|
| Startup Orchestrator | Validates OPA policies, waits for Kafka, creates topics, registers models, smoke-tests all policies on boot. |
| Processor | Enriches raw events with posture scoring, intent analysis, CEP signals. Publishes PostureEnrichedEvent. |
| Policy Decider | Evaluates OPA prompt policy on enriched events. Publishes DecisionEvent. |
| Agent Orchestrator | Plans tool execution and memory access based on OPA intent manifest. |
| Executor | Runs authorised tools. Implements tool registry with approval flow for side-effect operations. |
| Tool Parser | Extracts and validates structured tool calls from LLM output before forwarding to executor. |
| Memory Service | Scoped key-value store in Redis with integrity hashing, injection protection, and soft delete. |
| Output Guard | Optional second-pass LLM semantic scan of responses for subtle policy violations. |
| Retrieval Gateway | RAG-ready retrieval service. Scores document chunks for trust before injecting into LLM context. |
| Freeze Controller | Real-time model suspension via Kafka. Freeze propagates to all consumers within milliseconds. |
| Policy Simulator | Dry-run any policy change before deployment. Returns allow/block/escalate without touching live traffic. |
| SPM Aggregator | Consumes posture and audit events, writes to PostgreSQL, updates Prometheus metrics. |
| SPM API | REST API for model registry, compliance evidence, approval workflow, and audit export. |
| Guard Model | Llama Guard 3 (8B) inference service. Screens every prompt for harmful content categories. |
| Feature | Description |
|---|---|
| Orbyx Admin Portal | An AI Security Posture Management control plane providing real-time visibility, risk detection, and policy enforcement across agents, models, and context flows.. |
| Orbyx Chat UI | React + Vite chat interface with landing state, simulated streaming, model selector, and New Chat button. |
| Tool Use Badges | Web search and fetch tool calls rendered as blue pill badges above the response text. |
| Security Footer | Persistent footer: "All messages are screened by the Orbyx security layer" — visible on every message. |
| Mock Fallback | UI falls back to mock responses when API is unreachable. Graceful degradation for demos. |
| Cross-Session Memory UI | Claude remembers previous conversations across sessions — no user action required. |
| Model Selector | Switch between Claude Haiku / Sonnet / Opus from the chat header or landing page. |
Features not yet implemented — candidates for the next sprint:
- Human-in-the-loop escalation — middle-risk requests (0.30–0.70) route to a human reviewer queue
- Automated compliance reports — one-click PDF/DOCX export of NIST AI RMF posture for auditors
- Model drift detection — alert when a model's risk score distribution shifts after a provider update
- Shadow mode — run a candidate model in parallel without serving its responses, compare metrics
- Cost tracking — token spend per tenant/user/model tracked in Prometheus and Grafana
- Alerting — Slack/email when blocked requests spike above configurable threshold
- Hallucination scoring — post-response confidence estimation using a lightweight verifier model
- Local model support — Ollama/vLLM integration for HuggingFace models on Apple Silicon or GPU
- A/B model routing — split traffic between two approved models and compare quality/risk metrics
- Fine-grained tool RBAC — different user roles get access to different tools
- Session replay — replay any conversation in the audit UI for incident investigation
Orbyx AI SPM v3.0 · April 2026
- Prerequisites
- Clone & Configure
- API Keys
- First Boot
- Verify the Platform
- Access the UI & Dashboards
- Run the Smoke Test
- Stopping & Cleaning Up
- Troubleshooting
- Environment Reference
| Tool | Minimum version | Notes |
|---|---|---|
| Docker | 24+ | The kind cluster runs as Docker containers; any Docker daemon works (Docker Desktop, OrbStack, Colima, native dockerd on Linux). |
| Docker Compose | v2.20+ | Used for docker compose build to produce service images that get pushed to the kind-side registry. |
| Git | any | To clone the repo |
| Make | any | brew install make (macOS) / apt install make (Linux) |
| 4 GB free RAM | — | Kafka + all services |
| 2 GB free disk | — | Images + volumes |
All images are published for
linux/arm64. The compose file already sets the correct platform tags.
git clone https://github.com/your-org/orbyx-aispm.git
cd orbyx-aispmCopy the example environment file:
cp .env.example .envDo not commit your
.envfile — it is already in.gitignore.
Open .env in any editor and fill in the two required secrets:
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxx
ANTHROPIC_MODEL=claude-sonnet-4-6 # or claude-haiku-4-5-20251001 / claude-opus-4-6Get a key at console.anthropic.com.
TAVILY_API_KEY=tvly-xxxxxxxxxxxxGet a free key at app.tavily.com. Without this key, the web search tool will silently skip search calls and Claude will answer from its training data only.
GROQ_API_KEY=gsk_xxxxxxxxxxxx
HUNT_MODEL=llama-3.3-70b-versatile # model used by the threat-hunting agentGet a free key at console.groq.com.
Groq is used in two places:
- Threat Hunting Agent —
GROQ_API_KEYis required. Without it thethreat-hunting-agentservice will refuse to start. - Guard Model (Llama Guard) — optional. Without a key the guard falls back to a built-in regex classifier (still functional, just less accurate).
make upThis single command will:
- Build all Docker images from source
- Start the full infrastructure stack (Kafka, Redis, PostgreSQL, OPA, Prometheus, Grafana)
- Run the startup orchestrator, which automatically:
- Generates RSA key-pair into
./keys/(used for JWT signing) - Creates Kafka topics and ACLs per tenant
- Seeds OPA with the default policy bundle
- Registers the default AI model in the SPM registry
- Generates RSA key-pair into
- Start all platform services (API, Guard Model, CEP, SPM, UI, etc.)
The orchestrator exits when provisioning is complete. Expect the first build to take 3–5 minutes depending on your internet speed. Subsequent starts are near-instant.
You'll see this when it's ready:
✓ Platform started.
Admin chat: http://localhost:3001/
Admin: http://localhost:3001/admin
API: http://localhost:8080
Guard Model: http://localhost:8200
Freeze Controller: http://localhost:8090
Policy Simulator: http://localhost:8091
OPA: http://localhost:8181
Check that all services are healthy:
make statusExpected output shows all containers as Up or healthy. The API and Guard Model health endpoints will return JSON {"status": "ok"}.
Alternatively:
docker compose ps| Service | URL | Credentials |
|---|---|---|
| Orbyx Chat UI | http://localhost:3000 | Auto-login via JWT (click "Sign In") |
| Grafana | http://localhost:3001 | admin / admin (change on first login) |
| Prometheus | http://localhost:9090 | No auth |
| OPA | http://localhost:8181 | No auth |
| SPM API | http://localhost:8092 | JWT Bearer token required |
| Policy Simulator | http://localhost:8091 | JWT Bearer token required |
Three dashboards are pre-provisioned and load automatically:
- AI SPM Overview — posture scores, enforcement actions, risk trends
- Engineering — tool calls, blocked requests, model performance, CEP events
- Compliance — NIST AI RMF control coverage, audit trail
Send a real request through the full pipeline and verify end-to-end:
make smoke-testThis will:
- Mint a demo JWT
- Send
"What meetings do I have today?"→ expects a Claude response - Send a prompt injection attempt → expects
HTTP 400(blocked)
A passing run ends with:
✓ Smoke test PASSED
Stop all services (keeps data):
docker compose down
# with auth overlay:
docker compose -f compose.yml -f compose.auth.yml downStart local Docker Compose stack:
docker compose up -dBootstrap or re-deploy the full K8s/Helm stack:
bash deploy/scripts/bootstrap-cluster.shStop and wipe all data (volumes, generated keys):
make clean
⚠️ make cleandeletes the RSA keys in./keys/and the Keycloak realm volume (keycloak-data). New keys are auto-generated on next boot (invalidates existing JWTs). You will also need to redo the first-time Keycloak setup.
Check the orchestrator logs:
docker compose logs startup-orchestratorCommon causes: Kafka not ready in time. Re-run make up — it is idempotent.
Use the service name (not the container name) with docker compose:
docker compose restart startup-orchestrator # ✓ correct
docker compose restart cpm-startup-orchestrator # ✗ wrongThis indicates a model gate rejection. Check that LLM_MODEL_ID in .env is blank:
LLM_MODEL_ID=Then restart the API:
docker compose up -d --build apiThe model name in your .env is outdated. Update to a current model:
ANTHROPIC_MODEL=claude-sonnet-4-6Current valid model IDs:
| Label | Model ID |
|---|---|
| Claude Haiku | claude-haiku-4-5-20251001 |
| Claude Sonnet | claude-sonnet-4-6 |
| Claude Opus | claude-opus-4-6 |
Panels populate after the first real request is processed. Run make smoke-test to generate events, then refresh the dashboard.
If any port (3000, 3001, 8080, etc.) is already in use, edit compose.yml and change the host-side port mapping:
ports:
- "3100:3000" # change 3000 → 3100 (host:container)The audit_export table is missing the session_id column added in migration 002. Run the migration once while the stack is up:
# Option A — via Alembic
docker compose exec spm-api alembic upgrade head
# Option B — direct SQL
docker compose exec spm-db psql -U spm_rw -d spm -c "
ALTER TABLE audit_export ADD COLUMN IF NOT EXISTS session_id VARCHAR(64);
CREATE INDEX IF NOT EXISTS idx_audit_export_session_id ON audit_export (session_id);
"The threat-hunting-agent requires a Groq API key. Set it in .env:
GROQ_API_KEY=gsk_xxxxxxxxxxxxGet a free key at console.groq.com, then rebuild the service:
docker compose up -d --build threat-hunting-agentdocker compose up -d --build api # rebuild API only
docker compose up -d --build ui # rebuild UI only
docker compose up -d --build spm-aggregator # rebuild SPM aggregator
docker compose up -d --build threat-hunting-agent # rebuild threat hunting agentThe following variables can be tuned in .env. All have sane defaults and only the API keys need to be set for a working installation.
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key |
ANTHROPIC_MODEL |
claude-sonnet-4-6 |
Claude model to use |
TAVILY_API_KEY |
(optional) | Tavily key for web search tool |
GROQ_API_KEY |
(required for threat hunting) | Groq key — powers the Threat Hunting Agent LLM and optionally accelerates Llama Guard 3 |
HUNT_MODEL |
llama-3.3-70b-versatile |
Groq model used by the threat-hunting agent |
HUNT_BATCH_WINDOW_SEC |
30 |
Kafka batch window for the threat-hunting agent (seconds) |
THREATHUNTING_AI_INTERVAL_SEC |
300 |
Proactive threat scan interval (seconds) |
TENANTS |
t1 |
Comma-separated tenant IDs |
RATE_LIMIT_RPM |
60 |
Max requests per minute per user |
GUARD_MODEL_ENABLED |
true |
Enable/disable content guard |
POSTURE_BLOCK_THRESHOLD |
0.70 |
Risk score at which requests are blocked |
CEP_SHORT_WINDOW_SEC |
120 |
Burst detection window (seconds) |
CEP_LONG_WINDOW_SEC |
3600 |
Sustained volume window (seconds) |
MEMORY_LONGTERM_TTL_SEC |
2592000 |
Cross-session memory TTL (30 days) |
SPM_SNAPSHOT_INTERVAL_SEC |
300 |
Posture snapshot interval (5 min) |
GRAFANA_ADMIN_PASSWORD |
admin |
Grafana admin password |
REDIS_PASSWORD |
(blank) | Redis password (blank = no auth) |
SPM_DB_PASSWORD |
spmpass |
PostgreSQL password for SPM DB |
LLM_MODEL_ID |
(blank) | SPM model registry ID (leave blank to bypass gate) |
make up # Start everything
make down # Stop everything
make status # Health check
make logs # Tail all logs
make logs-api # Tail API logs only
make smoke-test # End-to-end test
make token # Mint a demo user JWT
make admin-token # Mint an admin JWT
make freeze # Freeze demo user (requires admin token)
make unfreeze # Unfreeze demo user
make clean # Wipe all data and keys- Prerequisites
- Clone & Configure
- API Keys
- First Boot
- Verify the Platform
- Access the UI & Dashboards
- Run the Smoke Test
- Stopping & Cleaning Up
- Troubleshooting
- Environment Reference
| Tool | Minimum version | Notes |
|---|---|---|
| Docker | 24+ | The kind cluster runs as Docker containers; any Docker daemon works (Docker Desktop, OrbStack, Colima, native dockerd on Linux). |
| Docker Compose | v2.20+ | Used for docker compose build to produce service images that get pushed to the kind-side registry. |
| Git | any | To clone the repo |
| Make | any | brew install make (macOS) / apt install make (Linux) |
| 4 GB free RAM | — | Kafka + all services |
| 2 GB free disk | — | Images + volumes |
Apple Silicon (M1/M2/M3): All images are published for
linux/arm64. The compose file already sets the correct platform tags.
git clone https://github.com/your-org/orbyx-aispm.git
cd orbyx-aispmCopy the example environment file:
cp .env.example .envDo not commit your
.envfile — it is already in.gitignore.
Open .env in any editor and fill in the two required secrets:
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxx
ANTHROPIC_MODEL=claude-sonnet-4-6 # or claude-haiku-4-5-20251001 / claude-opus-4-6Get a key at console.anthropic.com.
TAVILY_API_KEY=tvly-xxxxxxxxxxxxGet a free key at app.tavily.com. Without this key, the web search tool will silently skip search calls and Claude will answer from its training data only.
GROQ_API_KEY=gsk_xxxxxxxxxxxx
HUNT_MODEL=llama-3.3-70b-versatile # model used by the threat-hunting agentGet a free key at console.groq.com.
Groq is used in two places:
- Threat Hunting Agent —
GROQ_API_KEYis required. Without it thethreat-hunting-agentservice will refuse to start. - Guard Model (Llama Guard) — optional. Without a key the guard falls back to a built-in regex classifier (still functional, just less accurate).
make upThis single command will:
- Build all Docker images from source
- Start the full infrastructure stack (Kafka, Redis, PostgreSQL, OPA, Prometheus, Grafana)
- Run the startup orchestrator, which automatically:
- Generates RSA key-pair into
./keys/(used for JWT signing) - Creates Kafka topics and ACLs per tenant
- Seeds OPA with the default policy bundle
- Registers the default AI model in the SPM registry
- Generates RSA key-pair into
- Start all platform services (API, Guard Model, CEP, SPM, UI, etc.)
The orchestrator exits when provisioning is complete. Expect the first build to take 3–5 minutes depending on your internet speed. Subsequent starts are near-instant.
You'll see this when it's ready:
✓ Platform started.
Chat: http://localhost:3001
Admin portal: http://localhost:3001/admin
API: http://localhost:8080
Guard Model: http://localhost:8200
Freeze Controller: http://localhost:8090
Policy Simulator: http://localhost:8091
OPA: http://localhost:8181
Check that all services are healthy:
make statusExpected output shows all containers as Up or healthy. The API and Guard Model health endpoints will return JSON {"status": "ok"}.
Alternatively:
docker compose ps| Service | URL | Credentials |
|---|---|---|
| Orbyx Chat UI | http://localhost:3000 | Auto-login via JWT (click "Sign In") |
| Grafana | http://localhost:3001 | admin / admin (change on first login) |
| Prometheus | http://localhost:9090 | No auth |
| OPA | http://localhost:8181 | No auth |
| SPM API | http://localhost:8092 | JWT Bearer token required |
| Policy Simulator | http://localhost:8091 | JWT Bearer token required |
Three dashboards are pre-provisioned and load automatically:
- AI SPM Overview — posture scores, enforcement actions, risk trends
- Engineering — tool calls, blocked requests, model performance, CEP events
- Compliance — NIST AI RMF control coverage, audit trail
Send a real request through the full pipeline and verify end-to-end:
make smoke-testThis will:
- Mint a demo JWT
- Send
"What meetings do I have today?"→ expects a Claude response - Send a prompt injection attempt → expects
HTTP 400(blocked)
A passing run ends with:
✓ Smoke test PASSED
Stop all services (keeps data):
docker compose down
# with auth overlay:
docker compose -f compose.yml -f compose.auth.yml downStart local Docker Compose stack:
docker compose up -dBootstrap or re-deploy the full K8s/Helm stack:
bash deploy/scripts/bootstrap-cluster.shStop and wipe all data (volumes, generated keys):
make clean
⚠️ make cleandeletes the RSA keys in./keys/and the Keycloak realm volume (keycloak-data). New keys are auto-generated on next boot (invalidates existing JWTs). You will also need to redo the first-time Keycloak setup.
Check the orchestrator logs:
docker compose logs startup-orchestratorCommon causes: Kafka not ready in time. Re-run make up — it is idempotent.
Use the service name (not the container name) with docker compose:
docker compose restart startup-orchestrator # ✓ correct
docker compose restart cpm-startup-orchestrator # ✗ wrongThis indicates a model gate rejection. Check that LLM_MODEL_ID in .env is blank:
LLM_MODEL_ID=Then restart the API:
docker compose up -d --build apiThe model name in your .env is outdated. Update to a current model:
ANTHROPIC_MODEL=claude-sonnet-4-6Current valid model IDs:
| Label | Model ID |
|---|---|
| Claude Haiku | claude-haiku-4-5-20251001 |
| Claude Sonnet | claude-sonnet-4-6 |
| Claude Opus | claude-opus-4-6 |
Panels populate after the first real request is processed. Run make smoke-test to generate events, then refresh the dashboard.
If any port (3000, 3001, 8080, etc.) is already in use, edit compose.yml and change the host-side port mapping:
ports:
- "3100:3000" # change 3000 → 3100 (host:container)The audit_export table is missing the session_id column added in migration 002. Run the migration once while the stack is up:
# Option A — via Alembic
docker compose exec spm-api alembic upgrade head
# Option B — direct SQL
docker compose exec spm-db psql -U spm_rw -d spm -c "
ALTER TABLE audit_export ADD COLUMN IF NOT EXISTS session_id VARCHAR(64);
CREATE INDEX IF NOT EXISTS idx_audit_export_session_id ON audit_export (session_id);
"The threat-hunting-agent requires a Groq API key. Set it in .env:
GROQ_API_KEY=gsk_xxxxxxxxxxxxGet a free key at console.groq.com, then rebuild the service:
docker compose up -d --build threat-hunting-agentdocker compose up -d --build api # rebuild API only
docker compose up -d --build ui # rebuild UI only
docker compose up -d --build spm-aggregator # rebuild SPM aggregator
docker compose up -d --build threat-hunting-agent # rebuild threat hunting agentThe following variables can be tuned in .env. All have sane defaults and only the API keys need to be set for a working installation.
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key |
ANTHROPIC_MODEL |
claude-sonnet-4-6 |
Claude model to use |
TAVILY_API_KEY |
(optional) | Tavily key for web search tool |
GROQ_API_KEY |
(required for threat hunting) | Groq key — powers the Threat Hunting Agent LLM and optionally accelerates Llama Guard 3 |
HUNT_MODEL |
llama-3.3-70b-versatile |
Groq model used by the threat-hunting agent |
HUNT_BATCH_WINDOW_SEC |
30 |
Kafka batch window for the threat-hunting agent (seconds) |
THREATHUNTING_AI_INTERVAL_SEC |
300 |
Proactive threat scan interval (seconds) |
TENANTS |
t1 |
Comma-separated tenant IDs |
RATE_LIMIT_RPM |
60 |
Max requests per minute per user |
GUARD_MODEL_ENABLED |
true |
Enable/disable content guard |
POSTURE_BLOCK_THRESHOLD |
0.70 |
Risk score at which requests are blocked |
CEP_SHORT_WINDOW_SEC |
120 |
Burst detection window (seconds) |
CEP_LONG_WINDOW_SEC |
3600 |
Sustained volume window (seconds) |
MEMORY_LONGTERM_TTL_SEC |
2592000 |
Cross-session memory TTL (30 days) |
SPM_SNAPSHOT_INTERVAL_SEC |
300 |
Posture snapshot interval (5 min) |
GRAFANA_ADMIN_PASSWORD |
admin |
Grafana admin password |
REDIS_PASSWORD |
(blank) | Redis password (blank = no auth) |
SPM_DB_PASSWORD |
spmpass |
PostgreSQL password for SPM DB |
LLM_MODEL_ID |
(blank) | SPM model registry ID (leave blank to bypass gate) |
make up # Start everything
make down # Stop everything
make status # Health check
make logs # Tail all logs
make logs-api # Tail API logs only
make smoke-test # End-to-end test
make token # Mint a demo user JWT
make admin-token # Mint an admin JWT
make freeze # Freeze demo user (requires admin token)
make unfreeze # Unfreeze demo user
make clean # Wipe all data and keysOpen http://localhost:3000 in your browser.
-
Click Sign In — a demo JWT is minted automatically.
-
Type a message in the input box and press Enter or click Send.
-
Claude will respond. If a web search or web fetch was used, you'll see a badge above the reply:
🔍 Searched: "latest AI news"🌐 Fetched: https://example.com -
Use the model selector (top-right) to switch between Haiku, Sonnet, and Opus.
Claude remembers your previous messages across sessions for 30 days. You can refer back to earlier conversations naturally — no need to repeat context.
Some prompts are automatically blocked by the platform:
| Block type | Example trigger | HTTP code |
|---|---|---|
| Prompt injection | "Ignore previous instructions…" | 400 |
| High posture score | Repeated suspicious patterns | 400 |
| Model gate | Unapproved model ID in request | 403 |
| Output guard | Sensitive data in LLM response | 400 |
When a request is blocked the UI shows a red error message explaining why.
Mint tokens and manage users from the terminal:
# Mint a regular user token
make token
# Mint an admin token
make admin-token
# Freeze a user (blocks all their requests)
make freeze
# Unfreeze a user
make unfreezeOpen http://localhost:3001 → login with admin / admin.
| Dashboard | What to look at |
|---|---|
| AI SPM Overview | Real-time posture scores, enforcement actions, risk trends per tenant |
| Engineering | Tool call counts, blocked requests with reasons, CEP events, model latency |
| Compliance | NIST AI RMF control coverage, 30-day audit trail |
Dashboards auto-refresh every 30 seconds. Use the time-range picker (top-right) to zoom into a specific window.
Base URL: http://localhost:8092
All endpoints require a Bearer token. Use make admin-token or make spm-token-auditor to get one.
# List registered AI models
TOKEN=$(make admin-token -s)
curl -H "Authorization: Bearer $TOKEN" http://localhost:8092/models
# Register a new model
curl -X POST http://localhost:8092/models \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name":"my-model","version":"1.0","provider":"openai","risk_tier":"limited"}'
# NIST AI RMF compliance report
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8092/compliance/nist-airm/reportTest policy changes against sample events before rolling them out:
make simulateOr call the API directly at http://localhost:8091/simulate with a JSON payload of candidate policy + sample events. The response shows which events would be allowed, escalated, or blocked under the new policy.
make logs # all services
make logs-api # API only
make logs-spm-api # SPM API only
docker compose logs -f guard-model # any service by name- Open Grafana → Engineering dashboard → Blocked Requests table
- Note the
reasonandsession_id - Search logs:
make logs-api | grep <session_id>
TOKEN=$(make admin-token -s)
curl -H "Authorization: Bearer $TOKEN" \
"http://localhost:8092/posture?tenant_id=t1&user_id=user-demo-1"make spm-complianceReturns a JSON report mapping NIST AI RMF controls to pass/fail/partial status based on current platform configuration.
A full reference of every technology, library, and external service used in the platform.
| Component | Technology | Version | Role |
|---|---|---|---|
| Container runtime | Docker + Docker Compose | Compose v2 | Runs all services locally |
| Message broker | Apache Kafka (Confluent) | 7.6.1 | Event streaming backbone — audit, posture, CEP events |
| Cache / memory | Redis | 7 (Alpine) | Session memory, long-term conversation history, rate limiting |
| Database | PostgreSQL | 16 (Alpine) | SPM audit log, posture snapshots, model registry |
| Policy engine | Open Policy Agent (OPA) | 0.70.0 | Rego-based request policy evaluation |
| Metrics | Prometheus | v2.55.1 | Scrapes all service /metrics endpoints |
| Dashboards | Grafana | 11.4.0 | Pre-provisioned AI SPM, Engineering, and Compliance dashboards |
All backend services are written in Python 3.11 and served with FastAPI + Uvicorn.
| Service | Port | Description |
|---|---|---|
api |
8080 | Main gateway — auth, guard, LLM proxy, rate limiting |
guard-model |
8200 | Content moderation (Llama Guard 3 via Groq or regex fallback) |
freeze-controller |
8090 | Admin freeze/unfreeze of users and tenants |
policy-simulator |
8091 | Dry-run policy evaluation against sample events |
spm-api |
8092 | Model registry, posture API, compliance reports |
spm-aggregator |
— | Kafka consumer → Postgres writer, Prometheus metrics |
processor |
— | Kafka consumer — enriches raw events with posture scores |
memory-service |
— | Manages session, long-term, and system memory in Redis |
output-guard |
— | Second-pass LLM scan on Claude's responses |
policy-decider |
— | Evaluates OPA decisions and emits enforcement events |
retrieval-gateway |
— | Context retrieval for RAG (tool results, calendar, etc.) |
tool-parser |
— | Parses and validates tool call requests |
executor |
— | Executes approved tool calls |
agent |
— | Orchestrates multi-step agentic workflows |
agent-orchestrator |
8094 | Session lifecycle, risk scoring, threat finding storage, case management |
threat-hunting-agent |
— | Autonomous AI threat hunter — 9 proactive scans + LangChain/Groq LLM |
startup-orchestrator |
— | One-shot init container: keys, Kafka topics, OPA seed |
| Library | Version | Used for |
|---|---|---|
| FastAPI | 0.115.x | REST API framework |
| Uvicorn | 0.30–0.32 | ASGI server |
| Pydantic | 2.9 | Request/response validation |
| anthropic | 0.40.0 | Claude API client (tool use, streaming) |
| kafka-python-ng | 2.2.3 | Kafka producer/consumer |
| redis | 5.1–5.2 | Redis client |
| PyJWT + cryptography | 2.9–2.10 / 43.0 | RS256 JWT signing and verification |
| httpx | 0.27.2 | Async HTTP client (tool fetch, inter-service calls) |
| requests | 2.32 | Sync HTTP client |
| SQLAlchemy (asyncio) | 2.0.36 | Async ORM for SPM database |
| asyncpg | 0.30.0 | Async PostgreSQL driver |
| psycopg2-binary | 2.9.9 | Sync PostgreSQL driver |
| groq | 0.11.0 | Groq client for Llama Guard 3 inference |
| tavily-python | 0.5.0 | Web search tool (Tavily API) |
| beautifulsoup4 + lxml | 4.12.3 / 5.3.0 | HTML parsing for web fetch tool |
| prometheus-client | 0.21.1 | Exposes /metrics endpoint |
| prometheus-fastapi-instrumentator | 7.0.0 | Auto-instruments FastAPI with Prometheus |
| weasyprint | 62.3 | PDF report generation (compliance exports) |
| Technology | Version | Role |
|---|---|---|
| React | 18.3 | UI framework |
| Vite | 5.4 | Build tool and dev server |
| react-markdown | 9.0 | Renders Claude's markdown responses |
| remark-gfm | 4.0 | GitHub-flavored markdown (tables, strikethrough, etc.) |
The UI is a single-page app served by an Nginx container (ui) on port 3000. No external CSS framework — fully custom design with CSS variables for theming.
| Service | Purpose | Required |
|---|---|---|
| Anthropic Claude | LLM backend (Haiku / Sonnet / Opus) | ✅ Yes |
| Tavily | Real-time web search tool for Claude | |
| Groq | Threat Hunting Agent LLM (Llama 3.3 70B) + Llama Guard 3 content moderation | ✅ Required (threat hunting) / |
| Component | Technology | Notes |
|---|---|---|
| Authentication | RS256 JWT | Key-pair auto-generated at startup into ./keys/ |
| Authorization | OPA + Rego | Policy-as-code, evaluated per request |
| Content moderation | Llama Guard 3 (Groq) | Falls back to regex classifier if no Groq key |
| Output scanning | Second-pass LLM guard | Checks Claude responses for sensitive data leakage |
| Rate limiting | In-process Redis counter | Configurable RPM per user |
| Prompt injection detection | Guard model + CEP patterns | Pattern-matched and ML-scored |
| Layer | Technology | Details |
|---|---|---|
| Metrics | Prometheus | Scraped from all services every 15 s |
| Dashboards | Grafana | 3 pre-provisioned dashboards, auto-loaded via provisioning config |
| Audit log | PostgreSQL (audit_export table) |
Every request written as JSONB with full event payload |
| Structured logs | Python logging → stdout |
Collected by Docker, viewable via make logs |
| Posture snapshots | PostgreSQL + Prometheus | 5-min bucketed risk scores per tenant |
User prompt
│
▼
JWT Auth → Rate Limit → Guard Model (content check)
│
▼
OPA Policy Evaluation
│
▼
Memory Load (Redis — last 20 turns, 30-day TTL)
│
▼
Claude API (tool loop — up to 3 rounds)
│ ├── web_search → Tavily
│ └── web_fetch → httpx + BeautifulSoup
▼
Output Guard (second-pass LLM scan)
│
▼
Audit Event → Kafka → SPM Aggregator → PostgreSQL + Prometheus
│
▼
Response → User
The threat-hunting-agent is an autonomous AI security service that continuously scans the platform for threats — independent of user-triggered requests. It runs a LangChain agent backed by Groq + Llama 3.3 70B Versatile and fires on two triggers:
- Kafka consumer — reacts to session events in near real-time
- Scheduler — runs a full proactive scan cycle every 5 minutes (configurable via
THREATHUNTING_AI_INTERVAL_SEC)
Every scan cycle runs all 9 detectors in parallel. Each detector queries live data (Postgres, Redis, /proc) and produces structured findings that the agent analyses with the LLM before posting to the orchestrator.
| Scan | What it detects |
|---|---|
exposed_credentials |
API keys, tokens, and passwords stored in Redis under unexpected namespaces |
sensitive_data_exposure |
PII patterns, DB connection strings in Redis (broader sweep) |
unused_open_ports |
Internal service ports reachable that should not be (misconfigured or rogue services) |
unexpected_listen_ports |
Ports in LISTEN state in /proc/net/tcp not on the allowed service list |
overprivileged_tools |
AI models in the registry with unacceptable risk tier still set to active |
runtime_anomaly_detection |
High-frequency actors, enforcement block clusters (3+/session/hour), session storms (5+/actor/10 min) |
prompt_secret_exfiltration |
API keys and bearer tokens inside prompt/response text in the audit log |
data_leakage_detection |
SSNs, credit card numbers, email addresses in agent response text |
tool_misuse_detection |
High tool-call frequency (>20/actor/hour), rapid chaining (>5 calls/session/min), high block ratios |
When a scan finds an anomaly the LLM produces a structured threat finding (severity, hypothesis, evidence, recommended actions) which is:
- Posted to the agent-orchestrator via
POST /api/v1/threat-findings - Deduplicated by batch hash — the same pattern won't flood the findings tab
- Automatically prioritised (risk score, recency, occurrence count)
- Escalated to a Case when
should_open_case=trueand priority score ≥ 0.40
GROQ_API_KEY=gsk_xxxxxxxxxxxx # required — service won't start without it
HUNT_MODEL=llama-3.3-70b-versatile # LLM model (any Groq-hosted model)
HUNT_BATCH_WINDOW_SEC=30 # Kafka batch window
THREATHUNTING_AI_INTERVAL_SEC=300 # Proactive scan interval (seconds)The threat hunting collectors query the session_id column on the audit_export table. If you are upgrading an existing installation run the migration before starting the agent:
# Option A — via Alembic (recommended)
docker compose exec spm-api alembic upgrade head
# Option B — direct SQL (if Alembic is unavailable)
docker compose exec spm-db psql -U spm_rw -d spm -c "
ALTER TABLE audit_export ADD COLUMN IF NOT EXISTS session_id VARCHAR(64);
CREATE INDEX IF NOT EXISTS idx_audit_export_session_id ON audit_export (session_id);
"| Topic | Producers | Consumers |
|---|---|---|
{tenant}.raw_events |
API gateway | Processor, CEP |
{tenant}.posture_events |
Processor | SPM Aggregator, Policy Decider |
{tenant}.enforcement_actions |
Policy Decider, Freeze Controller | SPM Aggregator |
{tenant}.audit_export |
API gateway, SPM services | SPM Aggregator → Postgres |
| Layer | Language | Runtime |
|---|---|---|
| All backend services | Python 3.11 | CPython |
| Frontend | JavaScript (ESM) | Node 20 (build only), Nginx (serve) |
| Policy | Rego | OPA 0.70 |
| Infrastructure config | YAML / Dockerfile | Docker Compose v2 |
| Database migrations | SQL | PostgreSQL 16 |
| Build automation | Make | GNU Make |
Thanks for your interest in contributing! Here's everything you need to get started.
- Fork the repository and clone your fork
- Follow INSTALL.md to get the platform running locally
- Create a feature branch:
git checkout -b feat/your-feature-name
Most services are hot-reloaded in development. After editing Python files, rebuild only the affected service:
docker compose up -d --build api # API changes
docker compose up -d --build spm-api # SPM API changes
docker compose up -d --build ui # Frontend changesmake test # unit tests (no Docker needed)
make smoke-test # end-to-end test against running platformTests live in tests/. Please add or update tests for any new behaviour.
make logs # all services
make logs-api # single service- One concern per PR — keep changes focused and reviewable
- Write a clear description — what changed and why
- Include tests — new features and bug fixes should have test coverage
- Pass CI — all tests must be green before review
- Update docs — if you change behaviour, update the relevant
.mdfile
Branch naming:
| Type | Pattern |
|---|---|
| Feature | feat/short-description |
| Bug fix | fix/short-description |
| Docs | docs/short-description |
| Refactor | refactor/short-description |
services/ # Backend microservices (Python / FastAPI)
ui/ # Frontend (React + Vite)
platform_shared/ # Shared Python modules (JWT, Kafka, models)
spm/ # SPM policy and compliance definitions
opa/ # OPA Rego policies
grafana/ # Dashboard JSON and provisioning config
prometheus/ # Scrape config
tests/ # Unit and integration tests
scripts/ # Dev utilities (JWT minting, etc.)
Please open a GitHub Issue and include:
- A clear description of the problem
- Steps to reproduce
- Relevant logs (
make logs-apioutput) - Your environment (OS, Docker version, chip architecture)
- Python — follow PEP 8; use type hints where practical
- JavaScript — standard ESM; no external linting config required
- Commits — use Conventional Commits (
feat:,fix:,docs:, etc.)
The auth overlay adds a full OIDC login flow in front of the admin portal, running entirely on localhost. No real domain or TLS certificate required.
| Component | Role |
|---|---|
| Traefik v3 | Reverse proxy. Routes aispm.local → admin UI via the ForwardAuth middleware. Uses a static file provider — no Docker socket required. |
| Keycloak 24 | OIDC identity provider running in dev mode. Realm config is persisted to ./DataVolumes/keycloak/ (host bind mount) so it survives restarts and docker compose down -v. |
| traefik-forward-auth | Sits in front of every protected route. Inspects every request via X-Forwarded-Uri — including /_oauth callbacks — and sets a signed session cookie on aispm.local. |
Add these entries to /etc/hosts on your Mac:
sudo sh -c 'echo "127.0.0.1 keycloak.local auth.local aispm.local" >> /etc/hosts'# Start the full stack with auth overlay:
docker compose -f compose.yml -f compose.auth.yml up -d
# Stop (data preserved in ./DataVolumes/):
docker compose -f compose.yml -f compose.auth.yml downOnly required once — Keycloak persists the realm to ./DataVolumes/keycloak/h2/.
- Start the stack (see above) then open http://keycloak.local:8180/admin/ (
admin/admin) - Top-left dropdown → Create realm → name:
aispm→ Create - Realm Settings → General tab → Require SSL → set to none → Save
- Required for local-dev HTTP. If left at the default (
external), Keycloak rejects every non-localhost request with the page "We are sorry... HTTPS required" and traefik-forward-auth's token exchange silently fails. - Repeat the same toggle on the master realm if you want to log into the admin UI via
keycloak.local(master defaults toexternaltoo).
- Required for local-dev HTTP. If left at the default (
- Clients → Create client
- Client ID:
traefik-forward-auth - Turn ON Client authentication → Next
- Valid redirect URIs:
http://aispm.local/_oauth - Web origins:
http://aispm.local→ Save
- Client ID:
- Credentials tab → copy Client secret → paste into
.env.auth:PROVIDERS_OIDC_CLIENT_SECRET=<paste here> - Realm roles → Create role → name:
spm:admin→ Save. Repeat forspm:auditor.- The spm-api enforces these via
require_admin/require_auditor. Withoutspm:adminin the JWT roles claim, every integration write endpoint returns 403.
- The spm-api enforces these via
- Users → Create user → set username and email → Create
- Credentials tab → set password → turn OFF Temporary → Save password
- Role mapping tab → Assign role → tick
spm:admin→ Assign - Restart forward-auth:
docker compose -f compose.auth.yml up -d --force-recreate traefik-forward-auth
If you'd rather skip the UI, the same setup via kcadm — useful when the master realm's "HTTPS required" gate is locking you out of the admin console:
KC=/opt/keycloak/bin/kcadm.sh
docker compose exec keycloak $KC config credentials \
--server http://localhost:8080 --realm master --user admin --password admin
# Disable SSL gate on both realms (master = admin console, aispm = the app)
docker compose exec keycloak $KC update realms/master -s sslRequired=NONE
docker compose exec keycloak $KC update realms/aispm -s sslRequired=NONE
# Create the two realm roles spm-api expects
docker compose exec keycloak $KC create roles -r aispm -s name=spm:admin
docker compose exec keycloak $KC create roles -r aispm -s name=spm:auditor
# Create user, set password, assign admin role
docker compose exec keycloak $KC create users -r aispm \
-s username=dany -s enabled=true -s email=dany@example.com
docker compose exec keycloak $KC set-password -r aispm \
--username dany --new-password dany
docker compose exec keycloak $KC add-roles -r aispm \
--uusername dany --rolename spm:admin| URL | What |
|---|---|
| http://aispm.local/admin | Admin portal (SSO protected — redirects to Keycloak login) |
| http://keycloak.local:8180/admin/ | Keycloak admin console (master realm, admin/admin) |
| http://localhost:9091/dashboard/ | Traefik routing dashboard |
| File | Purpose |
|---|---|
compose.auth.yml |
Compose overlay — adds Traefik, Keycloak, and traefik-forward-auth. Keycloak data is bind-mounted from ./DataVolumes/keycloak/. |
auth/traefik.yml |
Traefik static config (file provider, entrypoints, dashboard on :9091) |
auth/traefik-dynamic.yml |
Route + middleware definitions. Important: there is a single aispm router covering every path — including /_oauth. The SSO middleware itself recognizes the OIDC callback via X-Forwarded-Uri and short-circuits it. Do not add a separate router that routes /_oauth directly to the forward-auth backend service (e.g. aispm-oauth: service: auth-svc) — that strips the X-Forwarded-* headers and forward-auth then can't tell it's a callback, falling into an infinite redirect loop. |
.env.auth |
OIDC client ID, client secret, and cookie signing secret |
| Symptom | Likely cause | Fix |
|---|---|---|
Browser endlessly bounces between aispm.local/_oauth?code=... and Keycloak |
auth/traefik-dynamic.yml has a router that sends /_oauth directly to forward-auth as a backend, skipping the middleware (drops X-Forwarded-Uri). |
Remove the /_oauth router; let the catch-all aispm router with the sso middleware handle every path. |
| Keycloak page: "We are sorry… HTTPS required" | Realm sslRequired is external (default). |
kcadm update realms/<realm> -s sslRequired=NONE for both master and aispm, OR access the admin console via http://localhost:8180/ (localhost bypasses the gate). |
Login succeeds but /_oauth returns "Cookie not found" |
Stale _forward_auth_csrf cookies from a previous failed run. |
Close all incognito windows (don't just open a new tab) and start a fresh session. |
| spm-api returns 403 "spm:admin role required" after login | The JWT user has no spm:admin realm role. |
In Keycloak: realm aispm → Users → user → Role mapping → assign spm:admin. |
LOG_LEVEL=debug doesn't take effect |
Compose only reloads env on recreate. | docker compose -f compose.auth.yml up -d --force-recreate traefik-forward-auth |
Postgres, Keycloak, Redis, Grafana, and the agent-orchestrator all bind-mount their state under ./DataVolumes/ instead of using Docker named volumes. Layout:
DataVolumes/
├── spm-db/ ← Postgres data dir (UID 999 inside container)
├── keycloak/h2/ ← Keycloak embedded H2 DB (realms, users, secrets)
├── redis/ ← Redis AOF / dump
├── grafana/ ← Grafana SQLite + dashboards
└── agent-orchestrator/ ← Orchestrator SQLite session log
The directories are tracked via .gitkeep; their contents are gitignored (see .gitignore). To reset any one of them, stop the relevant service, rm -rf the directory contents, and restart.
Migrations live in spm/alembic/versions/. The CI workflow runs alembic upgrade head automatically, but local containers do not — if you pull new migrations, run them by hand:
cd spm
SPM_DB_URL="postgresql://spm_rw:spmpass@localhost:5432/spm" alembic upgrade headIf the DB's alembic_version row points at a revision that no longer exists in spm/alembic/versions/ (can happen after switching branches or restoring a snapshot), alembic upgrade head errors with Can't locate revision identified by 'NNN'. Reset the bookmark to the latest revision actually present, then re-run:
cd spm
# Replace 003 with whatever is the highest revision file present
SPM_DB_URL="..." alembic stamp --purge 003
SPM_DB_URL="..." alembic upgrade headThe repo's migrations are written to be idempotent (ADD COLUMN IF NOT EXISTS, CREATE INDEX IF NOT EXISTS), so re-running a stamped revision is safe.






