Implement provisioner auto-scaling and capacity guidelines

## Problem

During the Sept 30 workshop, provisioner capacity (6 replicas for default org) became a bottleneck when ~10 users simultaneously deployed workspaces. Provisioners are critical for workspace create/delete/update operations, and insufficient capacity causes timeouts and poor user experience.

## Context

**Current State**:
- Default org: 6 replicas @ 500m CPU / 512 MB memory each
- Experimental org: 2 replicas
- Demo org: 2 replicas
- Manual scaling required before workshops

**Current Limitations**:
- Terraform runs are single-threaded (1 provisioner = 1 concurrent operation)
- Each workspace create/delete/update occupies 1 provisioner
- No auto-scaling based on queue depth
- No clear guidelines on when to scale

## Requirements

### Capacity Planning Guidelines

- [ ] Document scaling recommendations:
  - `<10 concurrent users`: 6 replicas (current default)
  - `10-15 concurrent users`: 8 replicas
  - `15-20 concurrent users`: 10 replicas
  - `20-30 concurrent users`: 12-15 replicas
- [ ] Add to pre-workshop checklist (#4)
- [ ] Add to workshop planning guide

### Manual Scaling Procedures

- [ ] Document commands for scaling each org's provisioners:
  ```bash
  kubectl scale deployment coder-provisioner-default -n coder --replicas=10
  kubectl scale deployment coder-provisioner-experimental -n coder --replicas=4
  kubectl scale deployment coder-provisioner-demo -n coder --replicas=4
  ```
- [ ] Add to incident runbook (#4) ✅ (already added)
- [ ] Create pre-workshop scaling checklist item

### Auto-Scaling Implementation (Long-term)

- [ ] Investigate Horizontal Pod Autoscaler (HPA) for provisioners
- [ ] Define custom metrics for provisioner queue depth
- [ ] Implement HPA based on:
  - Provisioner queue depth
  - CPU/memory utilization
  - Active Terraform jobs
- [ ] Test auto-scaling behavior under load
- [ ] Document auto-scaling configuration in Terraform

### Resource Limit Optimization

- [ ] Evaluate if 500m CPU / 512 MB is sufficient
- [ ] Monitor for OOMKilled or CPU throttling events
- [ ] Consider increasing to 1 CPU / 1 GB if needed
- [ ] Document resource limit adjustment procedure

### Monitoring & Alerting

- [ ] Add metrics for:
  - Provisioner queue depth
  - Provisioner job duration (p50, p95, p99)
  - Provisioner failure rate
  - Number of active provisioner replicas
- [ ] Alert when:
  - Queue depth > 5 jobs for >2 minutes
  - Provisioner failure rate > 5%
  - Average job duration > 5 minutes
- [ ] Add to monitoring dashboard (#6)

## Success Criteria

- Clear scaling guidelines available for workshop planning
- Manual scaling can be performed in <2 minutes
- (Long-term) Auto-scaling triggers before users experience delays
- Zero workspace timeouts due to provisioner capacity during workshops
- Provisioner resource usage optimized (no OOMKills)

## Implementation Notes

**HPA Example**:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: coder-provisioner-default-hpa
  namespace: coder
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coder-provisioner-default
  minReplicas: 6
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
```

**Queue Depth Metric** (requires custom implementation):
- Coder API may expose provisioner job queue
- Export as Prometheus metric
- Use for HPA scaling decisions

## Related

Sept 30 Workshop Postmortem
Incident Runbook - High Resource Contention
Incident Runbook - Provisioner Failures
#1 (Storage optimization)
#6 (Monitoring and alerting)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement provisioner auto-scaling and capacity guidelines #8

Problem

Context

Requirements

Capacity Planning Guidelines

Manual Scaling Procedures

Auto-Scaling Implementation (Long-term)

Resource Limit Optimization

Monitoring & Alerting

Success Criteria

Implementation Notes

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement provisioner auto-scaling and capacity guidelines #8

Description

Problem

Context

Requirements

Capacity Planning Guidelines

Manual Scaling Procedures

Auto-Scaling Implementation (Long-term)

Resource Limit Optimization

Monitoring & Alerting

Success Criteria

Implementation Notes

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions