Skip to content

Implement comprehensive resource monitoring and alerting #6

@blink-so

Description

@blink-so

Problem

The Sept 30 workshop had no proactive alerting for resource contention, storage capacity, or other critical thresholds. Issues were only discovered when users were already impacted.

Requirements

Metrics to Monitor

  • Ephemeral volume storage capacity per node (alert at 70%, 85%, 95%)
  • Concurrent workspace count
  • Workspace restart/failure rate
  • Image pull times across all clusters
  • LiteLLM authentication key expiration (alert at 7 days, 3 days, 1 day)
  • Subdomain routing success rate by region
  • Node resource utilization (CPU, memory, disk I/O)

Alerting

  • Configure alert destinations (Slack, PagerDuty, email)
  • Define escalation paths for different severity levels
  • Create alert runbook for each metric
  • Test alerting during pre-workshop validation

Dashboards

  • Real-time workshop dashboard showing key metrics
  • Historical trend analysis for month-over-month comparison
  • Per-region cluster health dashboard

Success Criteria

  • All critical metrics have alerting configured
  • Alerts fire before user-visible impact
  • Dashboard visible during workshops for real-time monitoring
  • Zero incidents caused by metrics that could have been alerted on

Related

Sept 30 Workshop Postmortem
#1 (Ephemeral volume storage)
#2 (Image management)
#3 (LiteLLM key management)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions