Skip to content

Add operational improvements: metrics, backups, log rotation (Issue #414, PR #3)#424

Merged
filthyrake merged 2 commits intodevfrom
feature/414-operational-improvements
Dec 27, 2025
Merged

Add operational improvements: metrics, backups, log rotation (Issue #414, PR #3)#424
filthyrake merged 2 commits intodevfrom
feature/414-operational-improvements

Conversation

@filthyrake
Copy link
Copy Markdown
Owner

Summary

This is the third and final PR addressing the infrastructure review (Issue #414). Completes the remaining operational improvements.

Prometheus Metrics

  • Add prometheus-client dependency
  • Create comprehensive metrics module (api/metrics.py) with:
    • HTTP request metrics (total, duration)
    • Video metrics (count by status, uploads)
    • Transcoding metrics (jobs, duration, queue size)
    • Worker metrics (count, heartbeats)
    • Re-encode queue metrics
    • Database metrics (connections, retries)
    • Redis metrics (operations, circuit breaker)
  • Add /metrics endpoint to Admin API (port 9001)
  • Add /api/metrics endpoint to Worker API (port 9002)

Automated Database Backups

  • Add k8s/backup-cronjob.yaml for PostgreSQL backups:
    • Daily at 2:00 AM UTC
    • pg_dump with compression
    • 7-day retention with automatic cleanup
    • Non-root security context with seccompProfile
    • Stores backups on NAS storage

Audit Log Rotation

  • Add VLOG_AUDIT_LOG_MAX_BYTES config (default: 10MB)
  • Add VLOG_AUDIT_LOG_BACKUP_COUNT config (default: 5)
  • Replace FileHandler with RotatingFileHandler
  • Prevents unbounded log growth

Issue #414 Progress Complete!

Test plan

  • Verify metrics endpoint returns valid Prometheus format: curl http://localhost:9001/metrics
  • Verify worker API metrics: curl http://localhost:9002/api/metrics
  • Validate backup CronJob manifest: kubectl apply -f k8s/backup-cronjob.yaml --dry-run=client
  • Verify audit log rotation config loads correctly
  • CI tests pass

🤖 Generated with Claude Code

filthyrake and others added 2 commits December 27, 2025 14:38
, PR #3)

This is the third and final PR addressing the infrastructure review (Issue #414).

## Prometheus Metrics
- Add prometheus-client dependency to pyproject.toml
- Create api/metrics.py with comprehensive metrics:
  - HTTP request metrics (requests total, duration)
  - Video metrics (total videos, uploads)
  - Transcoding metrics (jobs total/active, duration, queue size)
  - Worker metrics (total workers, heartbeats)
  - Re-encode queue metrics
  - Database metrics (connections, retries, query duration)
  - Redis metrics (operations, circuit breaker state)
  - Storage and playback metrics
- Add /metrics endpoint to Admin API (port 9001)
- Add /api/metrics endpoint to Worker API (port 9002)

## Automated Database Backups
- Add k8s/backup-cronjob.yaml for PostgreSQL backups
  - Runs daily at 2:00 AM UTC
  - Uses pg_dump with compression
  - 7-day retention with automatic cleanup
  - Proper security context (non-root, seccompProfile)
  - Mounts NAS storage for backup destination

## Audit Log Rotation
- Add AUDIT_LOG_MAX_BYTES config (default: 10MB)
- Add AUDIT_LOG_BACKUP_COUNT config (default: 5 backups)
- Replace FileHandler with RotatingFileHandler in api/audit.py
- Prevents unbounded log growth

## Issue #414 Checklist Completion
- [x] PR #1: Security scanning (Trivy, pip-audit, .dockerignore, multi-stage builds)
- [x] PR #2: Kubernetes security (pinned images, seccompProfile, NetworkPolicy)
- [x] PR #3: Operational improvements (this PR)
  - [x] Prometheus metrics endpoints
  - [x] Automated database backups
  - [x] Audit log rotation
  - [x] Docker build/push workflow (already existed)
  - [x] Database connection pooling (already existed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes critical and important issues identified in code review:

## Backup Script Fixes (Critical)
- Remove double compression (pg_dump custom format already compresses)
- Add `set -eo pipefail` for proper error handling
- Add backup integrity verification with `pg_restore --list`
- Change file extension from .sql.gz to .dump
- Remove corrupted backup file on verification failure

## Metrics Endpoint Fixes (Important)
- Standardize endpoint paths: both Admin and Worker APIs now use `/metrics`
- Document `/metrics` in AdminAuthMiddleware allowed paths list
- Note: `/metrics` already bypasses auth (not under /api/* path)

## Test Coverage
- Add tests/test_metrics.py with 12 tests covering:
  - Metrics module functionality
  - Prometheus format validation
  - Metric definitions (counters, gauges, histograms)
  - Audit log rotation configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@filthyrake filthyrake merged commit 58a8154 into dev Dec 27, 2025
8 of 10 checks passed
@filthyrake filthyrake deleted the feature/414-operational-improvements branch December 27, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant