DeployIQ

AI-powered Kubernetes deployment pipeline that autonomously detects failures, diagnoses root causes with an LLM agent, and rolls back — before your on-call engineer gets paged.

Deploy → Watch → Detect → Diagnose → Rollback
  │        │        │          │          │
Helm     Kafka   Prometheus  Claude    Helm SDK
 SDK    Events   Anomalies  + pgvector  <30s

What Is DeployIQ?

DeployIQ is a production-grade deployment automation system built in Go. You push code, it deploys via Helm, watches for failures through Kafka event streaming, and when something goes wrong — an LLM agent powered by Claude iteratively calls real Kubernetes tools to diagnose the root cause and execute a rollback automatically.

No more waking someone up at 3am for an ImagePullBackOff.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          DeployIQ System                            │
│                                                                     │
│  ┌──────────┐    ┌─────────────┐    ┌────────────────────────────┐ │
│  │  CLI     │    │  REST API   │    │    LLM Diagnostic Agent    │ │
│  │ (cobra)  │───▶│  (chi)      │───▶│                            │ │
│  │          │    │             │    │  ┌──────────────────────┐  │ │
│  │ deploy   │    │ /deploy     │    │  │  Claude API          │  │ │
│  │ rollback │    │ /diagnose   │    │  │  + Tool-Use Loop     │  │ │
│  │ status   │    │ /incidents  │    │  │                      │  │ │
│  │ incidents│    │ /rollback   │    │  │  query_metrics   ──▶ │  │ │
│  └──────────┘    └─────────────┘    │  │  read_logs       ──▶ │  │ │
│                                     │  │  search_incidents──▶ │  │ │
│  ┌──────────┐    ┌─────────────┐    │  │  get_deploy_info ──▶ │  │ │
│  │  Helm    │    │   Kafka     │    │  │  execute_rollback──▶ │  │ │
│  │  SDK     │    │  Streaming  │    │  └──────────────────────┘  │ │
│  │          │    │             │    │             │               │ │
│  │ Deploy() │    │ DeployStart │    │             ▼               │ │
│  │ Rollback │    │ HealthFail  │───▶│      pgvector RAG          │ │
│  │ History()│    │ Anomaly     │    │  (past incident lookup)    │ │
│  └──────────┘    └─────────────┘    └────────────────────────────┘ │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Observability Stack                                         │  │
│  │  Prometheus ──▶ Grafana    Alertmanager ──▶ Slack           │  │
│  │  6 custom metrics          alerting rules                   │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Infrastructure (Terraform)                                  │  │
│  │  kind / EKS  │  kube-prometheus-stack  │  Strimzi  │  pgvec │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

The Agent Loop

When an anomaly is detected, the agent runs an iterative tool-use loop — calling real Kubernetes APIs at each step:

Alert received (error_rate=0.82, threshold=0.05)
  │
  ▼
① query_metrics  ──▶  Prometheus: error_rate=82%, p99=4.2s
  │
  ▼
② read_logs      ──▶  K8s pod logs: NullPointerException in PaymentProcessor
  │
  ▼
③ search_past_incidents ──▶  pgvector cosine sim: similar incident (0.91) resolved by rollback
  │
  ▼
④ get_deployment_info   ──▶  K8s: 0/3 pods ready, revision 2 deployed 4m ago
  │
  ▼
⑤ Structured diagnosis:
   {
     "severity": "critical",
     "root_cause": "Null pointer in PaymentProcessor.process() — code regression in v2",
     "recommended_action": "Rollback to revision 1 immediately",
     "rollback_performed": true
   }

Features

	Feature	Details
🚀	Helm-native deploys	Helm SDK — no shell-out. Install, upgrade, rollback, history
🤖	LLM diagnostic agent	Claude API with 5 tools, iterative loop (max 5 iterations)
🧠	RAG over past incidents	pgvector cosine similarity — grounded diagnoses from history
📡	Real-time event streaming	Kafka — typed events, consumer groups, graceful shutdown
📊	Custom observability	6 Prometheus metrics, auto-provisioned Grafana dashboards
🔄	Auto-rollback	Agent-triggered or manual via API, <30s mean-time-to-rollback
🏗️	Full IaC	Terraform modules for K8s, monitoring, Kafka, PostgreSQL
🐳	Production Docker	Multi-stage build, <20MB alpine image, multi-arch (amd64/arm64)
✅	CI/CD pipeline	GitHub Actions: test + lint + build + 70% coverage gate + GHCR push
🔌	Works without API key	StubLLMClient drives realistic tool-use sequence for local dev

Quick Start

Prerequisites: Go 1.22+, Docker, kind, helm, kubectl

# 1. Clone
git clone https://github.com/dhruv7539/deployiq.git
cd deployiq

# 2. Start local services (Postgres, Kafka, Prometheus, Grafana)
make local-up

# 3. Build
make build

# 4. Run the server (works without API key — uses stub LLM)
./bin/deployiq-server

Open:

DeployIQ API: http://localhost:8080
Prometheus: http://localhost:9090
Grafana: http://localhost:3000 admin / admin

Try It End-to-End

# Deploy a sample app to your local kind cluster
helm install sample-app ./helm/sample-app --namespace default

# Inject a failure (bad image)
kubectl set image deployment/sample-app app=nginx:does-not-exist

# Watch DeployIQ detect and diagnose it
curl -X POST http://localhost:8080/api/v1/agent/diagnose \
  -H "Content-Type: application/json" \
  -d '{
    "release": "sample-app",
    "namespace": "default",
    "metric": "error_rate",
    "value": 1.0,
    "threshold": 0.05,
    "anomaly_type": "ImagePullBackOff"
  }'

# Check the stored incident
curl http://localhost:8080/api/v1/incidents | jq .

# Rollback
curl -X POST http://localhost:8080/api/v1/releases/default/sample-app/rollback \
  -H "Content-Type: application/json" \
  -d '{"revision": 1, "reason": "agent-diagnosed failure"}'

# Verify Prometheus counter incremented
curl -s http://localhost:8080/metrics | grep deployiq_agent

CLI Usage

# Deploy
./bin/deployiq deploy \
  --release api-server \
  --chart ./helm/sample-app \
  --namespace production \
  --image ghcr.io/myorg/api:v1.2.3

# Rollback
./bin/deployiq rollback --release api-server --revision 3

# Check deployment status
./bin/deployiq status --release api-server --namespace production

# List recent incidents
./bin/deployiq incidents list

# Run agent diagnosis manually
./bin/deployiq agent diagnose --deployment api-server --namespace production

API Reference

Method	Endpoint	Description
`GET`	`/healthz`	Liveness probe
`GET`	`/readyz`	Readiness probe
`GET`	`/metrics`	Prometheus metrics
`GET`	`/api/v1/deployments?namespace=default`	List deployments
`GET`	`/api/v1/deployments/{namespace}/{name}`	Get deployment
`POST`	`/api/v1/deployments`	Trigger deploy
`GET`	`/api/v1/incidents`	List recent incidents
`GET`	`/api/v1/incidents/{id}`	Get incident
`POST`	`/api/v1/incidents/{id}/resolve`	Resolve incident
`POST`	`/api/v1/agent/diagnose`	Run LLM diagnosis
`GET`	`/api/v1/releases/{namespace}/{release}/history`	Helm history
`POST`	`/api/v1/releases/{namespace}/{release}/rollback`	Rollback release

Configuration

deployiq.yaml or environment variables:

server:
  host: "0.0.0.0"
  port: 8080

database:
  host: localhost
  port: 5432
  user: deployiq
  password: deployiq
  name: deployiq

kafka:
  brokers: ["localhost:9092"]
  topic: "deployiq.events"
  group_id: "deployiq-consumer"

prometheus:
  url: "http://localhost:9090"

llm:
  provider: anthropic
  model: "claude-sonnet-4-20250514"
  api_key: ""          # or set ANTHROPIC_API_KEY env var

k8s:
  kubeconfig: ""       # defaults to ~/.kube/config
  in_cluster: false

# Required for real LLM diagnosis
export ANTHROPIC_API_KEY=sk-ant-...

# Optional — Slack alerts
export SLACK_WEBHOOK_URL=https://hooks.slack.com/...

Custom Prometheus Metrics

Metric	Type	Labels	Description
`deployiq_deploys_total`	Counter	`namespace`, `status`	Total deploys by outcome
`deployiq_deploy_duration_seconds`	Histogram	`namespace`	Deploy duration
`deployiq_rollbacks_total`	Counter	`namespace`, `reason`	Total rollbacks
`deployiq_agent_diagnoses_total`	Counter	`severity`	Agent diagnoses by severity
`deployiq_health_check_failures_total`	Counter	`namespace`, `deployment`	Health check failures
`deployiq_active_deployments`	Gauge	`namespace`	Currently active deployments

Project Structure

deployiq/
├── cmd/
│   ├── deployiq/          # CLI entrypoint (cobra commands)
│   └── server/            # API server entrypoint
├── internal/
│   ├── agent/             # LLM agent, tool definitions, RAG retriever
│   │   ├── agent.go       # Core diagnose loop + Anthropic client
│   │   ├── tools.go       # 5 tool implementations (real K8s calls)
│   │   ├── rag.go         # pgvector similarity retrieval
│   │   └── prompts.go     # System prompt + alert prompt builder
│   ├── api/               # HTTP router, handlers, middleware
│   ├── config/            # Viper config loading
│   ├── deployer/          # Helm SDK deploy/rollback + health checker
│   ├── incidents/         # PostgreSQL + pgvector incident store
│   ├── k8s/               # Kubernetes client wrapper (client-go)
│   ├── notifier/          # Slack notifications
│   ├── observability/     # Prometheus metrics + Grafana client
│   └── streaming/         # Kafka producer/consumer
├── helm/
│   ├── deployiq/          # DeployIQ server Helm chart
│   └── sample-app/        # Sample app for testing
├── terraform/
│   ├── modules/
│   │   ├── k8s-cluster/   # kind cluster
│   │   ├── monitoring/    # kube-prometheus-stack
│   │   ├── kafka/         # Strimzi operator
│   │   └── database/      # PostgreSQL + pgvector
│   └── environments/
│       ├── local/         # kind-based local setup
│       └── cloud/         # EKS-based cloud setup
├── dashboards/            # Grafana dashboard JSON
├── alerting/              # Prometheus alerting rules
├── docs/                  # Architecture, agent tools, setup docs
├── scripts/               # Setup, seed, demo scripts
├── docker-compose.yml     # Local dev stack
├── Dockerfile             # Multi-stage build (<20MB)
└── Makefile               # build, test, lint, local-up, seed

Tech Stack

Layer	Technology
Language	Go 1.22, CGO disabled, static binaries
CLI	cobra + viper
HTTP	go-chi/chi v5
Kubernetes	client-go + helm.sh/helm/v3 SDK
LLM	anthropic-sdk-go v1.30 (function calling)
Embeddings	OpenAI text-embedding-3-small (1536-dim)
Vector DB	pgvector (IVFFlat, cosine ops, 100 lists)
Database driver	pgx/v5
Event streaming	segmentio/kafka-go
Observability	prometheus/client_golang v1.23
Dashboards	Grafana HTTP API
IaC	Terraform 1.7+ with Helm + Kubernetes providers
Containers	Docker multi-stage, alpine:3.19 final, <20MB
CI/CD	GitHub Actions, GHCR, multi-arch (amd64/arm64)
Local K8s	kind (Kubernetes in Docker)

Development

make build        # compile CLI + server to ./bin/
make test         # run all tests with race detector
make lint         # golangci-lint
make local-up     # start docker-compose services
make local-down   # stop docker-compose services
make seed         # seed sample incidents into pgvector
make docker-build # build Docker image

Running tests without external dependencies:

# All tests use stubs — no API key, no DB, no K8s cluster needed
go test ./... -race -cover

How It Works Without an API Key

DeployIQ ships with a StubLLMClient that drives a realistic 4-step tool-use sequence against your real cluster:

StubLLM → query_metrics (hits real Prometheus)
        → read_logs     (pulls real pod logs from K8s)
        → search_past_incidents (searches real incident store)
        → get_deployment_info  (reads real deployment state)
        → structured JSON diagnosis

Every tool call hits your actual cluster. Only the final reasoning is canned. Swap in ANTHROPIC_API_KEY for real Claude reasoning over the data.

Kafka Event Schema

type DeployEvent struct {
    Type        EventType  // DeployStarted | DeployCompleted | HealthCheckFailed
    Release     string     //               | AnomalyDetected | RollbackInitiated
    Namespace   string
    Revision    int
    Image       string
    Metric      string
    Value       float64
    Threshold   float64
    Timestamp   time.Time
}

Roadmap

Slack alert integration with diagnosis summary
pgvector store for production (currently in-memory for dev)
Grafana dashboard auto-provisioning via API
Multi-cluster support
Web UI for incident timeline
OpenTelemetry traces through the agent loop

Contributing

PRs and issues welcome. See CONTRIBUTING.md.

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Make changes, run make test && make lint
Submit a PR — describe what you changed and why

License

MIT — see LICENSE.

Built with Go, Claude API, and a healthy fear of 3am pages.

Documentation · Architecture · Agent Tools · Local Setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeployIQ

AI-powered Kubernetes deployment pipeline that autonomously detects failures, diagnoses root causes with an LLM agent, and rolls back — before your on-call engineer gets paged.

What Is DeployIQ?

Architecture

The Agent Loop

Features

Quick Start

Try It End-to-End

CLI Usage

API Reference

Configuration

Custom Prometheus Metrics

Project Structure

Tech Stack

Development

How It Works Without an API Key

Kafka Event Schema

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
alerting		alerting
cmd		cmd
configs		configs
dashboards		dashboards
docs		docs
helm		helm
internal		internal
scripts		scripts
terraform		terraform
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

DeployIQ

AI-powered Kubernetes deployment pipeline that autonomously detects failures, diagnoses root causes with an LLM agent, and rolls back — before your on-call engineer gets paged.

What Is DeployIQ?

Architecture

The Agent Loop

Features

Quick Start

Try It End-to-End

CLI Usage

API Reference

Configuration

Custom Prometheus Metrics

Project Structure

Tech Stack

Development

How It Works Without an API Key

Kafka Event Schema

Roadmap

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages