-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.
Context
Current State:
- Default org: 6 replicas @ 500m CPU / 512 MB memory each
- Experimental org: 2 replicas
- Demo org: 2 replicas
- Manual scaling required before workshops
Current Limitations:
- Terraform runs are single-threaded (1 provisioner = 1 concurrent operation)
- Each workspace create/delete/update occupies 1 provisioner
- No auto-scaling based on queue depth
- No clear guidelines on when to scale
Requirements
Capacity Planning Guidelines
- Document scaling recommendations:
<10 concurrent users
: 6 replicas (current default)10-15 concurrent users
: 8 replicas15-20 concurrent users
: 10 replicas20-30 concurrent users
: 12-15 replicas
- Add to pre-workshop checklist (Create pre-workshop validation checklist and runbook #4)
- Add to workshop planning guide
Manual Scaling Procedures
- Document commands for scaling each org's provisioners:
kubectl scale deployment coder-provisioner-default -n coder --replicas=10 kubectl scale deployment coder-provisioner-experimental -n coder --replicas=4 kubectl scale deployment coder-provisioner-demo -n coder --replicas=4
- Add to incident runbook (Create pre-workshop validation checklist and runbook #4) ✅ (already added)
- Create pre-workshop scaling checklist item
Auto-Scaling Implementation (Long-term)
- Investigate Horizontal Pod Autoscaler (HPA) for provisioners
- Define custom metrics for provisioner queue depth
- Implement HPA based on:
- Provisioner queue depth
- CPU/memory utilization
- Active Terraform jobs
- Test auto-scaling behavior under load
- Document auto-scaling configuration in Terraform
Resource Limit Optimization
- Evaluate if 500m CPU / 512 MB is sufficient
- Monitor for OOMKilled or CPU throttling events
- Consider increasing to 1 CPU / 1 GB if needed
- Document resource limit adjustment procedure
Monitoring & Alerting
- Add metrics for:
- Provisioner queue depth
- Provisioner job duration (p50, p95, p99)
- Provisioner failure rate
- Number of active provisioner replicas
- Alert when:
- Queue depth > 5 jobs for >2 minutes
- Provisioner failure rate > 5%
- Average job duration > 5 minutes
- Add to monitoring dashboard (Implement comprehensive resource monitoring and alerting #6)
Success Criteria
- Clear scaling guidelines available for workshop planning
- Manual scaling can be performed in <2 minutes
- (Long-term) Auto-scaling triggers before users experience delays
- Zero workspace timeouts due to provisioner capacity during workshops
- Provisioner resource usage optimized (no OOMKills)
Implementation Notes
HPA Example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: coder-provisioner-default-hpa
namespace: coder
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coder-provisioner-default
minReplicas: 6
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Queue Depth Metric (requires custom implementation):
- Coder API may expose provisioner job queue
- Export as Prometheus metric
- Use for HPA scaling decisions
Related
Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)
Metadata
Metadata
Assignees
Labels
No labels