observability: parameterize worker-status alert log group per env (CS-11107) by lukemelia · Pull Request #4796 · cardstack/boxel

lukemelia · 2026-05-12T21:41:36Z

Summary

The same provisioning/alerting/worker-status-group.json ships to both staging and production Grafana, but had the staging log group (ecs-boxel-worker-staging, account 680542703984) hardcoded into both Worker Reap and Worker Startup rules. Prod Grafana queries via the prod task role hit ResourceNotFoundException on every evaluation — once per minute × 3 retries × 2 rules — flooding the alert pipeline (and SMTP retries on top of that, tracked separately).
Replace the hardcoded values with ${WORKER_LOG_GROUP_NAME} and ${WORKER_LOG_GROUP_ACCOUNT_ID}. apply-alerting.sh already supports ${VAR} substitution from the shell environment, and its header explicitly calls out log-group / account-id parameterization as the motivating use case.
Per-env values are set in each apply workflow's job-level env: block (staging → ecs-boxel-worker-staging / 680542703984; production → ecs-boxel-worker-production / 120317779495). Hardcoded rather than SSM-sourced: the names follow the deterministic ecs-boxel-worker-<env> convention and the account-id is already implicit in AWS_ROLE_ARN.
docker-compose.yml gets the same two vars with staging defaults so the local Grafana's file provisioning parses cleanly (local has no AWS creds, so the rule never actually evaluates).

Test plan

./scripts/lint.sh clean (JSON parses, manifests valid, shellcheck clean)
envsubst dry-run with staging values yields name=ecs-boxel-worker-staging, accountId=680542703984, correct ARN
envsubst dry-run with production values yields name=ecs-boxel-worker-production, accountId=120317779495, correct ARN
YAML env blocks parse cleanly (yq -e)
shellcheck on apply.sh + apply-alerting.sh clean
Stage-1 (merge): observability-apply-staging.yml runs against staging. Confirm both rules still evaluate cleanly there (they did before; this should be a no-op for staging).
Stage-2 (manual dispatch from main): observability-apply-production.yml. Tail prod Grafana logs for ≥1h after apply and verify no ResourceNotFoundException for rule_uid=berxlrqsnzoxse or eerxlrqsghzwgd.

🤖 Generated with Claude Code

The same provisioning/alerting/worker-status-group.json ships to both staging and production Grafana, but the file had the staging log group (ecs-boxel-worker-staging, account 680542703984) hardcoded into both the Worker Reap and Worker Startup rules. Prod Grafana queries with the prod task role hit ResourceNotFoundException on every evaluation (once per minute × 3 retries × 2 rules), flooding the alert pipeline. Replace the hardcoded values with ${WORKER_LOG_GROUP_NAME} and ${WORKER_LOG_GROUP_ACCOUNT_ID} — the existing apply-alerting.sh already supports ${VAR} substitution from the shell environment, and its header explicitly calls out log-group / account-id parameterization as the motivating use case. Per-env values are set in the job-level env: block of each apply workflow (staging → ecs-boxel-worker-staging / 680542703984; production → ecs-boxel-worker-production / 120317779495). Hardcoded in the workflow rather than fetched from SSM: the names follow the deterministic ecs-boxel-worker-<env> convention and the account-id is already implicit in AWS_ROLE_ARN. docker-compose.yml gets the same two vars with staging defaults so the local Grafana's file provisioning parses cleanly (local has no AWS creds, so the rule never evaluates). Fixes CS-11107. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-12T21:42:11Z

Observability diff (vs staging)

No dashboard / folder changes detected against the staging Grafana.

(Run: https://github.com/cardstack/boxel/actions/runs/25763899826)

github-actions · 2026-05-12T22:03:13Z

Host Test Results

1 files 1 suites 1h 43m 41s ⏱️
2 658 tests 2 643 ✅ 15 💤 0 ❌
2 677 runs 2 662 ✅ 15 💤 0 ❌

Results for commit af590ca.

Realm Server Test Results

1 files 1 suites 11m 36s ⏱️
1 334 tests 1 334 ✅ 0 💤 0 ❌
1 413 runs 1 413 ✅ 0 💤 0 ❌

Results for commit af590ca.

lukemelia marked this pull request as ready for review May 12, 2026 21:42

lukemelia requested review from a team and backspace May 12, 2026 22:01

habdelra approved these changes May 12, 2026

View reviewed changes

lukemelia merged commit abc476a into main May 13, 2026
79 of 80 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability: parameterize worker-status alert log group per env (CS-11107)#4796

observability: parameterize worker-status alert log group per env (CS-11107)#4796
lukemelia merged 1 commit into
mainfrom
cs-11107-fix-worker-reap-worker-startup-alert-rules-querying-staging

lukemelia commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukemelia commented May 12, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 12, 2026

Observability diff (vs staging)

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 12, 2026 •

edited

Loading