AI-powered Kubernetes deployment pipeline that autonomously detects failures, diagnoses root causes with an LLM agent, and rolls back — before your on-call engineer gets paged.
Deploy → Watch → Detect → Diagnose → Rollback
│ │ │ │ │
Helm Kafka Prometheus Claude Helm SDK
SDK Events Anomalies + pgvector <30s
DeployIQ is a production-grade deployment automation system built in Go. You push code, it deploys via Helm, watches for failures through Kafka event streaming, and when something goes wrong — an LLM agent powered by Claude iteratively calls real Kubernetes tools to diagnose the root cause and execute a rollback automatically.
No more waking someone up at 3am for an ImagePullBackOff.
┌─────────────────────────────────────────────────────────────────────┐
│ DeployIQ System │
│ │
│ ┌──────────┐ ┌─────────────┐ ┌────────────────────────────┐ │
│ │ CLI │ │ REST API │ │ LLM Diagnostic Agent │ │
│ │ (cobra) │───▶│ (chi) │───▶│ │ │
│ │ │ │ │ │ ┌──────────────────────┐ │ │
│ │ deploy │ │ /deploy │ │ │ Claude API │ │ │
│ │ rollback │ │ /diagnose │ │ │ + Tool-Use Loop │ │ │
│ │ status │ │ /incidents │ │ │ │ │ │
│ │ incidents│ │ /rollback │ │ │ query_metrics ──▶ │ │ │
│ └──────────┘ └─────────────┘ │ │ read_logs ──▶ │ │ │
│ │ │ search_incidents──▶ │ │ │
│ ┌──────────┐ ┌─────────────┐ │ │ get_deploy_info ──▶ │ │ │
│ │ Helm │ │ Kafka │ │ │ execute_rollback──▶ │ │ │
│ │ SDK │ │ Streaming │ │ └──────────────────────┘ │ │
│ │ │ │ │ │ │ │ │
│ │ Deploy() │ │ DeployStart │ │ ▼ │ │
│ │ Rollback │ │ HealthFail │───▶│ pgvector RAG │ │
│ │ History()│ │ Anomaly │ │ (past incident lookup) │ │
│ └──────────┘ └─────────────┘ └────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Observability Stack │ │
│ │ Prometheus ──▶ Grafana Alertmanager ──▶ Slack │ │
│ │ 6 custom metrics alerting rules │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Infrastructure (Terraform) │ │
│ │ kind / EKS │ kube-prometheus-stack │ Strimzi │ pgvec │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
When an anomaly is detected, the agent runs an iterative tool-use loop — calling real Kubernetes APIs at each step:
Alert received (error_rate=0.82, threshold=0.05)
│
▼
① query_metrics ──▶ Prometheus: error_rate=82%, p99=4.2s
│
▼
② read_logs ──▶ K8s pod logs: NullPointerException in PaymentProcessor
│
▼
③ search_past_incidents ──▶ pgvector cosine sim: similar incident (0.91) resolved by rollback
│
▼
④ get_deployment_info ──▶ K8s: 0/3 pods ready, revision 2 deployed 4m ago
│
▼
⑤ Structured diagnosis:
{
"severity": "critical",
"root_cause": "Null pointer in PaymentProcessor.process() — code regression in v2",
"recommended_action": "Rollback to revision 1 immediately",
"rollback_performed": true
}
| Feature | Details | |
|---|---|---|
| 🚀 | Helm-native deploys | Helm SDK — no shell-out. Install, upgrade, rollback, history |
| 🤖 | LLM diagnostic agent | Claude API with 5 tools, iterative loop (max 5 iterations) |
| 🧠 | RAG over past incidents | pgvector cosine similarity — grounded diagnoses from history |
| 📡 | Real-time event streaming | Kafka — typed events, consumer groups, graceful shutdown |
| 📊 | Custom observability | 6 Prometheus metrics, auto-provisioned Grafana dashboards |
| 🔄 | Auto-rollback | Agent-triggered or manual via API, <30s mean-time-to-rollback |
| 🏗️ | Full IaC | Terraform modules for K8s, monitoring, Kafka, PostgreSQL |
| 🐳 | Production Docker | Multi-stage build, <20MB alpine image, multi-arch (amd64/arm64) |
| ✅ | CI/CD pipeline | GitHub Actions: test + lint + build + 70% coverage gate + GHCR push |
| 🔌 | Works without API key | StubLLMClient drives realistic tool-use sequence for local dev |
Prerequisites: Go 1.22+, Docker, kind, helm, kubectl
# 1. Clone
git clone https://github.com/dhruv7539/deployiq.git
cd deployiq
# 2. Start local services (Postgres, Kafka, Prometheus, Grafana)
make local-up
# 3. Build
make build
# 4. Run the server (works without API key — uses stub LLM)
./bin/deployiq-serverOpen:
- DeployIQ API: http://localhost:8080
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000
admin / admin
# Deploy a sample app to your local kind cluster
helm install sample-app ./helm/sample-app --namespace default
# Inject a failure (bad image)
kubectl set image deployment/sample-app app=nginx:does-not-exist
# Watch DeployIQ detect and diagnose it
curl -X POST http://localhost:8080/api/v1/agent/diagnose \
-H "Content-Type: application/json" \
-d '{
"release": "sample-app",
"namespace": "default",
"metric": "error_rate",
"value": 1.0,
"threshold": 0.05,
"anomaly_type": "ImagePullBackOff"
}'
# Check the stored incident
curl http://localhost:8080/api/v1/incidents | jq .
# Rollback
curl -X POST http://localhost:8080/api/v1/releases/default/sample-app/rollback \
-H "Content-Type: application/json" \
-d '{"revision": 1, "reason": "agent-diagnosed failure"}'
# Verify Prometheus counter incremented
curl -s http://localhost:8080/metrics | grep deployiq_agent# Deploy
./bin/deployiq deploy \
--release api-server \
--chart ./helm/sample-app \
--namespace production \
--image ghcr.io/myorg/api:v1.2.3
# Rollback
./bin/deployiq rollback --release api-server --revision 3
# Check deployment status
./bin/deployiq status --release api-server --namespace production
# List recent incidents
./bin/deployiq incidents list
# Run agent diagnosis manually
./bin/deployiq agent diagnose --deployment api-server --namespace production| Method | Endpoint | Description |
|---|---|---|
GET |
/healthz |
Liveness probe |
GET |
/readyz |
Readiness probe |
GET |
/metrics |
Prometheus metrics |
GET |
/api/v1/deployments?namespace=default |
List deployments |
GET |
/api/v1/deployments/{namespace}/{name} |
Get deployment |
POST |
/api/v1/deployments |
Trigger deploy |
GET |
/api/v1/incidents |
List recent incidents |
GET |
/api/v1/incidents/{id} |
Get incident |
POST |
/api/v1/incidents/{id}/resolve |
Resolve incident |
POST |
/api/v1/agent/diagnose |
Run LLM diagnosis |
GET |
/api/v1/releases/{namespace}/{release}/history |
Helm history |
POST |
/api/v1/releases/{namespace}/{release}/rollback |
Rollback release |
deployiq.yaml or environment variables:
server:
host: "0.0.0.0"
port: 8080
database:
host: localhost
port: 5432
user: deployiq
password: deployiq
name: deployiq
kafka:
brokers: ["localhost:9092"]
topic: "deployiq.events"
group_id: "deployiq-consumer"
prometheus:
url: "http://localhost:9090"
llm:
provider: anthropic
model: "claude-sonnet-4-20250514"
api_key: "" # or set ANTHROPIC_API_KEY env var
k8s:
kubeconfig: "" # defaults to ~/.kube/config
in_cluster: false# Required for real LLM diagnosis
export ANTHROPIC_API_KEY=sk-ant-...
# Optional — Slack alerts
export SLACK_WEBHOOK_URL=https://hooks.slack.com/...| Metric | Type | Labels | Description |
|---|---|---|---|
deployiq_deploys_total |
Counter | namespace, status |
Total deploys by outcome |
deployiq_deploy_duration_seconds |
Histogram | namespace |
Deploy duration |
deployiq_rollbacks_total |
Counter | namespace, reason |
Total rollbacks |
deployiq_agent_diagnoses_total |
Counter | severity |
Agent diagnoses by severity |
deployiq_health_check_failures_total |
Counter | namespace, deployment |
Health check failures |
deployiq_active_deployments |
Gauge | namespace |
Currently active deployments |
deployiq/
├── cmd/
│ ├── deployiq/ # CLI entrypoint (cobra commands)
│ └── server/ # API server entrypoint
├── internal/
│ ├── agent/ # LLM agent, tool definitions, RAG retriever
│ │ ├── agent.go # Core diagnose loop + Anthropic client
│ │ ├── tools.go # 5 tool implementations (real K8s calls)
│ │ ├── rag.go # pgvector similarity retrieval
│ │ └── prompts.go # System prompt + alert prompt builder
│ ├── api/ # HTTP router, handlers, middleware
│ ├── config/ # Viper config loading
│ ├── deployer/ # Helm SDK deploy/rollback + health checker
│ ├── incidents/ # PostgreSQL + pgvector incident store
│ ├── k8s/ # Kubernetes client wrapper (client-go)
│ ├── notifier/ # Slack notifications
│ ├── observability/ # Prometheus metrics + Grafana client
│ └── streaming/ # Kafka producer/consumer
├── helm/
│ ├── deployiq/ # DeployIQ server Helm chart
│ └── sample-app/ # Sample app for testing
├── terraform/
│ ├── modules/
│ │ ├── k8s-cluster/ # kind cluster
│ │ ├── monitoring/ # kube-prometheus-stack
│ │ ├── kafka/ # Strimzi operator
│ │ └── database/ # PostgreSQL + pgvector
│ └── environments/
│ ├── local/ # kind-based local setup
│ └── cloud/ # EKS-based cloud setup
├── dashboards/ # Grafana dashboard JSON
├── alerting/ # Prometheus alerting rules
├── docs/ # Architecture, agent tools, setup docs
├── scripts/ # Setup, seed, demo scripts
├── docker-compose.yml # Local dev stack
├── Dockerfile # Multi-stage build (<20MB)
└── Makefile # build, test, lint, local-up, seed
| Layer | Technology |
|---|---|
| Language | Go 1.22, CGO disabled, static binaries |
| CLI | cobra + viper |
| HTTP | go-chi/chi v5 |
| Kubernetes | client-go + helm.sh/helm/v3 SDK |
| LLM | anthropic-sdk-go v1.30 (function calling) |
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
| Vector DB | pgvector (IVFFlat, cosine ops, 100 lists) |
| Database driver | pgx/v5 |
| Event streaming | segmentio/kafka-go |
| Observability | prometheus/client_golang v1.23 |
| Dashboards | Grafana HTTP API |
| IaC | Terraform 1.7+ with Helm + Kubernetes providers |
| Containers | Docker multi-stage, alpine:3.19 final, <20MB |
| CI/CD | GitHub Actions, GHCR, multi-arch (amd64/arm64) |
| Local K8s | kind (Kubernetes in Docker) |
make build # compile CLI + server to ./bin/
make test # run all tests with race detector
make lint # golangci-lint
make local-up # start docker-compose services
make local-down # stop docker-compose services
make seed # seed sample incidents into pgvector
make docker-build # build Docker imageRunning tests without external dependencies:
# All tests use stubs — no API key, no DB, no K8s cluster needed
go test ./... -race -coverDeployIQ ships with a StubLLMClient that drives a realistic 4-step tool-use sequence against your real cluster:
StubLLM → query_metrics (hits real Prometheus)
→ read_logs (pulls real pod logs from K8s)
→ search_past_incidents (searches real incident store)
→ get_deployment_info (reads real deployment state)
→ structured JSON diagnosis
Every tool call hits your actual cluster. Only the final reasoning is canned. Swap in ANTHROPIC_API_KEY for real Claude reasoning over the data.
type DeployEvent struct {
Type EventType // DeployStarted | DeployCompleted | HealthCheckFailed
Release string // | AnomalyDetected | RollbackInitiated
Namespace string
Revision int
Image string
Metric string
Value float64
Threshold float64
Timestamp time.Time
}- Slack alert integration with diagnosis summary
- pgvector store for production (currently in-memory for dev)
- Grafana dashboard auto-provisioning via API
- Multi-cluster support
- Web UI for incident timeline
- OpenTelemetry traces through the agent loop
PRs and issues welcome. See CONTRIBUTING.md.
- Fork the repo
- Create a feature branch (
git checkout -b feat/my-feature) - Make changes, run
make test && make lint - Submit a PR — describe what you changed and why
MIT — see LICENSE.
Built with Go, Claude API, and a healthy fear of 3am pages.