-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
During the Sept 30 Agentic Workshop, limited node storage capacity for ephemeral volumes caused workspaces to die and restart when ~10 users simultaneously ran workloads. Self-healing mechanisms wiped user progress.
Root Cause
Node storage capacity planning did not account for concurrent workspace workloads competing for ephemeral volume resources.
Requirements
- Audit current ephemeral volume allocation per node
- Calculate storage requirements for target concurrent workspace count
- Implement storage capacity monitoring and alerting
- Define resource limits per workspace to prevent single workspace from exhausting node storage
- Test with realistic concurrent user load (use monthly workshops for validation)
- Document storage capacity planning methodology
Success Criteria
- Support 20+ concurrent workspaces without storage contention
- Alerting triggers before reaching critical storage thresholds
- No workspace restarts due to storage issues during monthly workshops
Related
Sept 30 Workshop Postmortem
Metadata
Metadata
Assignees
Labels
No labels