Skip to content

Create pre-workshop validation checklist and runbook #4

@blink-so

Description

@blink-so

Problem

The Sept 30 Agentic Workshop encountered multiple preventable issues that could have been caught with pre-event validation.

Requirements

Pre-Workshop Checklist

  • Verify LiteLLM authentication keys are valid (>7 days until expiration)
  • Validate image consistency across all clusters (control plane, Oregon, London)
  • Check ephemeral volume storage capacity on all nodes
  • Test subdomain routing across all regions
  • Verify resource limits and quotas are configured correctly
  • Run smoke test: deploy test workspace in each region and validate full lifecycle
  • Confirm monitoring and alerting is operational
  • Document expected concurrent user count for capacity planning

Incident Runbook

  • Document steps to diagnose workspace restart issues
  • Document steps to verify image consistency across clusters
  • Document emergency key rotation procedure for LiteLLM
  • Document how to identify and resolve resource contention
  • Include contact information and escalation paths

Success Criteria

  • Pre-workshop checklist can be completed in <30 minutes
  • Checklist catches issues seen in Sept 30 workshop
  • Runbook enables rapid diagnosis and resolution during incidents

Related

Sept 30 Workshop Postmortem

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions