Skip to content

dhruv7539/deployiq

Repository files navigation



DeployIQ

AI-powered Kubernetes deployment pipeline that autonomously detects failures, diagnoses root causes with an LLM agent, and rolls back — before your on-call engineer gets paged.


Deploy → Watch → Detect → Diagnose → Rollback
  │        │        │          │          │
Helm     Kafka   Prometheus  Claude    Helm SDK
 SDK    Events   Anomalies  + pgvector  <30s

What Is DeployIQ?

DeployIQ is a production-grade deployment automation system built in Go. You push code, it deploys via Helm, watches for failures through Kafka event streaming, and when something goes wrong — an LLM agent powered by Claude iteratively calls real Kubernetes tools to diagnose the root cause and execute a rollback automatically.

No more waking someone up at 3am for an ImagePullBackOff.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          DeployIQ System                            │
│                                                                     │
│  ┌──────────┐    ┌─────────────┐    ┌────────────────────────────┐ │
│  │  CLI     │    │  REST API   │    │    LLM Diagnostic Agent    │ │
│  │ (cobra)  │───▶│  (chi)      │───▶│                            │ │
│  │          │    │             │    │  ┌──────────────────────┐  │ │
│  │ deploy   │    │ /deploy     │    │  │  Claude API          │  │ │
│  │ rollback │    │ /diagnose   │    │  │  + Tool-Use Loop     │  │ │
│  │ status   │    │ /incidents  │    │  │                      │  │ │
│  │ incidents│    │ /rollback   │    │  │  query_metrics   ──▶ │  │ │
│  └──────────┘    └─────────────┘    │  │  read_logs       ──▶ │  │ │
│                                     │  │  search_incidents──▶ │  │ │
│  ┌──────────┐    ┌─────────────┐    │  │  get_deploy_info ──▶ │  │ │
│  │  Helm    │    │   Kafka     │    │  │  execute_rollback──▶ │  │ │
│  │  SDK     │    │  Streaming  │    │  └──────────────────────┘  │ │
│  │          │    │             │    │             │               │ │
│  │ Deploy() │    │ DeployStart │    │             ▼               │ │
│  │ Rollback │    │ HealthFail  │───▶│      pgvector RAG          │ │
│  │ History()│    │ Anomaly     │    │  (past incident lookup)    │ │
│  └──────────┘    └─────────────┘    └────────────────────────────┘ │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Observability Stack                                         │  │
│  │  Prometheus ──▶ Grafana    Alertmanager ──▶ Slack           │  │
│  │  6 custom metrics          alerting rules                   │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  Infrastructure (Terraform)                                  │  │
│  │  kind / EKS  │  kube-prometheus-stack  │  Strimzi  │  pgvec │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

The Agent Loop

When an anomaly is detected, the agent runs an iterative tool-use loop — calling real Kubernetes APIs at each step:

Alert received (error_rate=0.82, threshold=0.05)
  │
  ▼
① query_metrics  ──▶  Prometheus: error_rate=82%, p99=4.2s
  │
  ▼
② read_logs      ──▶  K8s pod logs: NullPointerException in PaymentProcessor
  │
  ▼
③ search_past_incidents ──▶  pgvector cosine sim: similar incident (0.91) resolved by rollback
  │
  ▼
④ get_deployment_info   ──▶  K8s: 0/3 pods ready, revision 2 deployed 4m ago
  │
  ▼
⑤ Structured diagnosis:
   {
     "severity": "critical",
     "root_cause": "Null pointer in PaymentProcessor.process() — code regression in v2",
     "recommended_action": "Rollback to revision 1 immediately",
     "rollback_performed": true
   }

Features

Feature Details
🚀 Helm-native deploys Helm SDK — no shell-out. Install, upgrade, rollback, history
🤖 LLM diagnostic agent Claude API with 5 tools, iterative loop (max 5 iterations)
🧠 RAG over past incidents pgvector cosine similarity — grounded diagnoses from history
📡 Real-time event streaming Kafka — typed events, consumer groups, graceful shutdown
📊 Custom observability 6 Prometheus metrics, auto-provisioned Grafana dashboards
🔄 Auto-rollback Agent-triggered or manual via API, <30s mean-time-to-rollback
🏗️ Full IaC Terraform modules for K8s, monitoring, Kafka, PostgreSQL
🐳 Production Docker Multi-stage build, <20MB alpine image, multi-arch (amd64/arm64)
CI/CD pipeline GitHub Actions: test + lint + build + 70% coverage gate + GHCR push
🔌 Works without API key StubLLMClient drives realistic tool-use sequence for local dev

Quick Start

Prerequisites: Go 1.22+, Docker, kind, helm, kubectl

# 1. Clone
git clone https://github.com/dhruv7539/deployiq.git
cd deployiq

# 2. Start local services (Postgres, Kafka, Prometheus, Grafana)
make local-up

# 3. Build
make build

# 4. Run the server (works without API key — uses stub LLM)
./bin/deployiq-server

Open:


Try It End-to-End

# Deploy a sample app to your local kind cluster
helm install sample-app ./helm/sample-app --namespace default

# Inject a failure (bad image)
kubectl set image deployment/sample-app app=nginx:does-not-exist

# Watch DeployIQ detect and diagnose it
curl -X POST http://localhost:8080/api/v1/agent/diagnose \
  -H "Content-Type: application/json" \
  -d '{
    "release": "sample-app",
    "namespace": "default",
    "metric": "error_rate",
    "value": 1.0,
    "threshold": 0.05,
    "anomaly_type": "ImagePullBackOff"
  }'

# Check the stored incident
curl http://localhost:8080/api/v1/incidents | jq .

# Rollback
curl -X POST http://localhost:8080/api/v1/releases/default/sample-app/rollback \
  -H "Content-Type: application/json" \
  -d '{"revision": 1, "reason": "agent-diagnosed failure"}'

# Verify Prometheus counter incremented
curl -s http://localhost:8080/metrics | grep deployiq_agent

CLI Usage

# Deploy
./bin/deployiq deploy \
  --release api-server \
  --chart ./helm/sample-app \
  --namespace production \
  --image ghcr.io/myorg/api:v1.2.3

# Rollback
./bin/deployiq rollback --release api-server --revision 3

# Check deployment status
./bin/deployiq status --release api-server --namespace production

# List recent incidents
./bin/deployiq incidents list

# Run agent diagnosis manually
./bin/deployiq agent diagnose --deployment api-server --namespace production

API Reference

Method Endpoint Description
GET /healthz Liveness probe
GET /readyz Readiness probe
GET /metrics Prometheus metrics
GET /api/v1/deployments?namespace=default List deployments
GET /api/v1/deployments/{namespace}/{name} Get deployment
POST /api/v1/deployments Trigger deploy
GET /api/v1/incidents List recent incidents
GET /api/v1/incidents/{id} Get incident
POST /api/v1/incidents/{id}/resolve Resolve incident
POST /api/v1/agent/diagnose Run LLM diagnosis
GET /api/v1/releases/{namespace}/{release}/history Helm history
POST /api/v1/releases/{namespace}/{release}/rollback Rollback release

Configuration

deployiq.yaml or environment variables:

server:
  host: "0.0.0.0"
  port: 8080

database:
  host: localhost
  port: 5432
  user: deployiq
  password: deployiq
  name: deployiq

kafka:
  brokers: ["localhost:9092"]
  topic: "deployiq.events"
  group_id: "deployiq-consumer"

prometheus:
  url: "http://localhost:9090"

llm:
  provider: anthropic
  model: "claude-sonnet-4-20250514"
  api_key: ""          # or set ANTHROPIC_API_KEY env var

k8s:
  kubeconfig: ""       # defaults to ~/.kube/config
  in_cluster: false
# Required for real LLM diagnosis
export ANTHROPIC_API_KEY=sk-ant-...

# Optional — Slack alerts
export SLACK_WEBHOOK_URL=https://hooks.slack.com/...

Custom Prometheus Metrics

Metric Type Labels Description
deployiq_deploys_total Counter namespace, status Total deploys by outcome
deployiq_deploy_duration_seconds Histogram namespace Deploy duration
deployiq_rollbacks_total Counter namespace, reason Total rollbacks
deployiq_agent_diagnoses_total Counter severity Agent diagnoses by severity
deployiq_health_check_failures_total Counter namespace, deployment Health check failures
deployiq_active_deployments Gauge namespace Currently active deployments

Project Structure

deployiq/
├── cmd/
│   ├── deployiq/          # CLI entrypoint (cobra commands)
│   └── server/            # API server entrypoint
├── internal/
│   ├── agent/             # LLM agent, tool definitions, RAG retriever
│   │   ├── agent.go       # Core diagnose loop + Anthropic client
│   │   ├── tools.go       # 5 tool implementations (real K8s calls)
│   │   ├── rag.go         # pgvector similarity retrieval
│   │   └── prompts.go     # System prompt + alert prompt builder
│   ├── api/               # HTTP router, handlers, middleware
│   ├── config/            # Viper config loading
│   ├── deployer/          # Helm SDK deploy/rollback + health checker
│   ├── incidents/         # PostgreSQL + pgvector incident store
│   ├── k8s/               # Kubernetes client wrapper (client-go)
│   ├── notifier/          # Slack notifications
│   ├── observability/     # Prometheus metrics + Grafana client
│   └── streaming/         # Kafka producer/consumer
├── helm/
│   ├── deployiq/          # DeployIQ server Helm chart
│   └── sample-app/        # Sample app for testing
├── terraform/
│   ├── modules/
│   │   ├── k8s-cluster/   # kind cluster
│   │   ├── monitoring/    # kube-prometheus-stack
│   │   ├── kafka/         # Strimzi operator
│   │   └── database/      # PostgreSQL + pgvector
│   └── environments/
│       ├── local/         # kind-based local setup
│       └── cloud/         # EKS-based cloud setup
├── dashboards/            # Grafana dashboard JSON
├── alerting/              # Prometheus alerting rules
├── docs/                  # Architecture, agent tools, setup docs
├── scripts/               # Setup, seed, demo scripts
├── docker-compose.yml     # Local dev stack
├── Dockerfile             # Multi-stage build (<20MB)
└── Makefile               # build, test, lint, local-up, seed

Tech Stack

Layer Technology
Language Go 1.22, CGO disabled, static binaries
CLI cobra + viper
HTTP go-chi/chi v5
Kubernetes client-go + helm.sh/helm/v3 SDK
LLM anthropic-sdk-go v1.30 (function calling)
Embeddings OpenAI text-embedding-3-small (1536-dim)
Vector DB pgvector (IVFFlat, cosine ops, 100 lists)
Database driver pgx/v5
Event streaming segmentio/kafka-go
Observability prometheus/client_golang v1.23
Dashboards Grafana HTTP API
IaC Terraform 1.7+ with Helm + Kubernetes providers
Containers Docker multi-stage, alpine:3.19 final, <20MB
CI/CD GitHub Actions, GHCR, multi-arch (amd64/arm64)
Local K8s kind (Kubernetes in Docker)

Development

make build        # compile CLI + server to ./bin/
make test         # run all tests with race detector
make lint         # golangci-lint
make local-up     # start docker-compose services
make local-down   # stop docker-compose services
make seed         # seed sample incidents into pgvector
make docker-build # build Docker image

Running tests without external dependencies:

# All tests use stubs — no API key, no DB, no K8s cluster needed
go test ./... -race -cover

How It Works Without an API Key

DeployIQ ships with a StubLLMClient that drives a realistic 4-step tool-use sequence against your real cluster:

StubLLM → query_metrics (hits real Prometheus)
        → read_logs     (pulls real pod logs from K8s)
        → search_past_incidents (searches real incident store)
        → get_deployment_info  (reads real deployment state)
        → structured JSON diagnosis

Every tool call hits your actual cluster. Only the final reasoning is canned. Swap in ANTHROPIC_API_KEY for real Claude reasoning over the data.


Kafka Event Schema

type DeployEvent struct {
    Type        EventType  // DeployStarted | DeployCompleted | HealthCheckFailed
    Release     string     //               | AnomalyDetected | RollbackInitiated
    Namespace   string
    Revision    int
    Image       string
    Metric      string
    Value       float64
    Threshold   float64
    Timestamp   time.Time
}

Roadmap

  • Slack alert integration with diagnosis summary
  • pgvector store for production (currently in-memory for dev)
  • Grafana dashboard auto-provisioning via API
  • Multi-cluster support
  • Web UI for incident timeline
  • OpenTelemetry traces through the agent loop

Contributing

PRs and issues welcome. See CONTRIBUTING.md.

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Make changes, run make test && make lint
  4. Submit a PR — describe what you changed and why

License

MIT — see LICENSE.


Built with Go, Claude API, and a healthy fear of 3am pages.

Documentation · Architecture · Agent Tools · Local Setup

About

AI-powered Kubernetes deployment pipeline — LLM diagnostic agent (Claude API), RAG over past incidents (pgvector), Kafka event streaming, auto-rollback via Helm SDK

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors