-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The Sept 30 Agentic Workshop encountered multiple preventable issues that could have been caught with pre-event validation.
Requirements
Pre-Workshop Checklist
- Verify LiteLLM authentication keys are valid (>7 days until expiration)
- Validate image consistency across all clusters (control plane, Oregon, London)
- Check ephemeral volume storage capacity on all nodes
- Test subdomain routing across all regions
- Verify resource limits and quotas are configured correctly
- Run smoke test: deploy test workspace in each region and validate full lifecycle
- Confirm monitoring and alerting is operational
- Document expected concurrent user count for capacity planning
Incident Runbook
- Document steps to diagnose workspace restart issues
- Document steps to verify image consistency across clusters
- Document emergency key rotation procedure for LiteLLM
- Document how to identify and resolve resource contention
- Include contact information and escalation paths
Success Criteria
- Pre-workshop checklist can be completed in <30 minutes
- Checklist catches issues seen in Sept 30 workshop
- Runbook enables rapid diagnosis and resolution during incidents
Related
Sept 30 Workshop Postmortem
Metadata
Metadata
Assignees
Labels
No labels