Skip to content

Implement provisioner auto-scaling and capacity guidelines #8

@blink-so

Description

@blink-so

Problem

During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.

Context

Current State:

  • Default org: 6 replicas @ 500m CPU / 512 MB memory each
  • Experimental org: 2 replicas
  • Demo org: 2 replicas
  • Manual scaling required before workshops

Current Limitations:

  • Terraform runs are single-threaded (1 provisioner = 1 concurrent operation)
  • Each workspace create/delete/update occupies 1 provisioner
  • No auto-scaling based on queue depth
  • No clear guidelines on when to scale

Requirements

Capacity Planning Guidelines

  • Document scaling recommendations:
    • <10 concurrent users: 6 replicas (current default)
    • 10-15 concurrent users: 8 replicas
    • 15-20 concurrent users: 10 replicas
    • 20-30 concurrent users: 12-15 replicas
  • Add to pre-workshop checklist (Create pre-workshop validation checklist and runbook #4)
  • Add to workshop planning guide

Manual Scaling Procedures

  • Document commands for scaling each org's provisioners:
    kubectl scale deployment coder-provisioner-default -n coder --replicas=10
    kubectl scale deployment coder-provisioner-experimental -n coder --replicas=4
    kubectl scale deployment coder-provisioner-demo -n coder --replicas=4
  • Add to incident runbook (Create pre-workshop validation checklist and runbook #4) ✅ (already added)
  • Create pre-workshop scaling checklist item

Auto-Scaling Implementation (Long-term)

  • Investigate Horizontal Pod Autoscaler (HPA) for provisioners
  • Define custom metrics for provisioner queue depth
  • Implement HPA based on:
    • Provisioner queue depth
    • CPU/memory utilization
    • Active Terraform jobs
  • Test auto-scaling behavior under load
  • Document auto-scaling configuration in Terraform

Resource Limit Optimization

  • Evaluate if 500m CPU / 512 MB is sufficient
  • Monitor for OOMKilled or CPU throttling events
  • Consider increasing to 1 CPU / 1 GB if needed
  • Document resource limit adjustment procedure

Monitoring & Alerting

  • Add metrics for:
    • Provisioner queue depth
    • Provisioner job duration (p50, p95, p99)
    • Provisioner failure rate
    • Number of active provisioner replicas
  • Alert when:
    • Queue depth > 5 jobs for >2 minutes
    • Provisioner failure rate > 5%
    • Average job duration > 5 minutes
  • Add to monitoring dashboard (Implement comprehensive resource monitoring and alerting #6)

Success Criteria

  • Clear scaling guidelines available for workshop planning
  • Manual scaling can be performed in <2 minutes
  • (Long-term) Auto-scaling triggers before users experience delays
  • Zero workspace timeouts due to provisioner capacity during workshops
  • Provisioner resource usage optimized (no OOMKills)

Implementation Notes

HPA Example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-provisioner-default-hpa
  namespace: coder
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-provisioner-default
  minReplicas: 6
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Queue Depth Metric (requires custom implementation):

  • Coder API may expose provisioner job queue
  • Export as Prometheus metric
  • Use for HPA scaling decisions

Related

Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions