-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The Sept 30 workshop had no proactive alerting for resource contention, storage capacity, or other critical thresholds. Issues were only discovered when users were already impacted.
Requirements
Metrics to Monitor
- Ephemeral volume storage capacity per node (alert at 70%, 85%, 95%)
- Concurrent workspace count
- Workspace restart/failure rate
- Image pull times across all clusters
- LiteLLM authentication key expiration (alert at 7 days, 3 days, 1 day)
- Subdomain routing success rate by region
- Node resource utilization (CPU, memory, disk I/O)
Alerting
- Configure alert destinations (Slack, PagerDuty, email)
- Define escalation paths for different severity levels
- Create alert runbook for each metric
- Test alerting during pre-workshop validation
Dashboards
- Real-time workshop dashboard showing key metrics
- Historical trend analysis for month-over-month comparison
- Per-region cluster health dashboard
Success Criteria
- All critical metrics have alerting configured
- Alerts fire before user-visible impact
- Dashboard visible during workshops for real-time monitoring
- Zero incidents caused by metrics that could have been alerted on
Related
Sept 30 Workshop Postmortem
#1 (Ephemeral volume storage)
#2 (Image management)
#3 (LiteLLM key management)
Metadata
Metadata
Assignees
Labels
No labels