Skip to content

Feature: Add additional Prometheus metrics and Grafana dashboard (#207)#496

Merged
filthyrake merged 2 commits intodevfrom
feature/207-additional-metrics
Jan 3, 2026
Merged

Feature: Add additional Prometheus metrics and Grafana dashboard (#207)#496
filthyrake merged 2 commits intodevfrom
feature/207-additional-metrics

Conversation

@filthyrake
Copy link
Copy Markdown
Owner

Summary

  • Add 5 new Prometheus metrics for enhanced observability:
    • HTTP_REQUESTS_IN_PROGRESS gauge (low-cardinality by API name)
    • VIDEOS_WATCH_TIME_SECONDS_TOTAL counter
    • WORKER_JOBS_COMPLETED_TOTAL counter (by worker_name)
    • WORKER_HEARTBEAT_AGE_SECONDS gauge (by worker_name)
    • STORAGE_VIDEOS_BYTES gauge with periodic reconciliation
  • Implement pure ASGI HTTPMetricsMiddleware for 6x better performance than BaseHTTPMiddleware
  • Add endpoint path normalization to prevent label cardinality explosion
  • Instrument existing but unused HTTP and transcoding metrics
  • Add Grafana dashboard JSON with panels for API, transcoding, workers, storage, and playback

Technical Details

  • Low Cardinality: Uses api label (admin/worker/public) instead of full endpoint paths
  • Worker Labels: Uses human-readable worker_name instead of UUID for lower cardinality
  • Background Updates: Heartbeat ages updated every 30s by background task (no DB query on /metrics endpoint)
  • Storage Reconciliation: Incremental tracking with filesystem scan every 6 hours to correct drift

Test plan

  • All 35 metrics tests pass
  • Lint check passes (ruff)
  • Verify middleware registers correctly on app startup
  • Test Grafana dashboard import works
  • Verify metrics appear in Prometheus after deployment

Closes #207

🤖 Generated with Claude Code

filthyrake and others added 2 commits January 3, 2026 14:27
Add 5 new Prometheus metrics for enhanced observability:
- HTTP_REQUESTS_IN_PROGRESS gauge (low-cardinality by API name)
- VIDEOS_WATCH_TIME_SECONDS_TOTAL counter
- WORKER_JOBS_COMPLETED_TOTAL counter (by worker_name)
- WORKER_HEARTBEAT_AGE_SECONDS gauge (by worker_name)
- STORAGE_VIDEOS_BYTES gauge with periodic reconciliation

Implementation highlights:
- Pure ASGI HTTPMetricsMiddleware for 6x better performance
- Endpoint path normalization to prevent cardinality explosion
- Background task updates heartbeat ages every 30s (no DB query on /metrics)
- Storage reconciliation scans filesystem every 6 hours
- Instrument existing but unused HTTP and transcoding metrics

Also includes:
- Grafana dashboard JSON with panels for API, transcoding, workers, storage
- Tests for all new metrics and middleware

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fixes:
- Add `api` label to HTTP_REQUESTS_TOTAL for Grafana dashboard compatibility
- Fix storage reconciliation with timeout, symlink protection, partial failure handling
- Add database retry logic (fetch_all_with_retry) to background task
- Fix storage metric for overwritten segments (track net change)

High priority fixes:
- Add LRU cache to normalize_endpoint() for 95%+ allocation reduction
- Replace _metrics.clear() with selective label removal to avoid race conditions
- Add worker name label sanitization to prevent label injection
- Add background task health metrics (errors, last_success, duration)

Medium priority improvements:
- Improve normalize_endpoint with UUID and slug pattern detection
- Make reconciliation interval configurable via VLOG_STORAGE_RECONCILIATION_INTERVAL
- Add VLOG_STORAGE_SCAN_TIMEOUT and VLOG_STORAGE_SCAN_MAX_FILES configs
- Add comprehensive tests for new features

New metrics added:
- BACKGROUND_TASK_ERRORS_TOTAL
- BACKGROUND_TASK_LAST_SUCCESS
- BACKGROUND_TASK_DURATION_SECONDS
- STORAGE_RECONCILIATION_STATUS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@filthyrake filthyrake merged commit 4efbc6a into dev Jan 3, 2026
5 checks passed
@filthyrake filthyrake deleted the feature/207-additional-metrics branch January 3, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus metrics endpoint

1 participant