Skip to content

observability: parameterize worker-status alert log group per env (CS-11107)#4796

Merged
lukemelia merged 1 commit into
mainfrom
cs-11107-fix-worker-reap-worker-startup-alert-rules-querying-staging
May 13, 2026
Merged

observability: parameterize worker-status alert log group per env (CS-11107)#4796
lukemelia merged 1 commit into
mainfrom
cs-11107-fix-worker-reap-worker-startup-alert-rules-querying-staging

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

Summary

  • The same provisioning/alerting/worker-status-group.json ships to both staging and production Grafana, but had the staging log group (ecs-boxel-worker-staging, account 680542703984) hardcoded into both Worker Reap and Worker Startup rules. Prod Grafana queries via the prod task role hit ResourceNotFoundException on every evaluation — once per minute × 3 retries × 2 rules — flooding the alert pipeline (and SMTP retries on top of that, tracked separately).
  • Replace the hardcoded values with ${WORKER_LOG_GROUP_NAME} and ${WORKER_LOG_GROUP_ACCOUNT_ID}. apply-alerting.sh already supports ${VAR} substitution from the shell environment, and its header explicitly calls out log-group / account-id parameterization as the motivating use case.
  • Per-env values are set in each apply workflow's job-level env: block (staging → ecs-boxel-worker-staging / 680542703984; production → ecs-boxel-worker-production / 120317779495). Hardcoded rather than SSM-sourced: the names follow the deterministic ecs-boxel-worker-<env> convention and the account-id is already implicit in AWS_ROLE_ARN.
  • docker-compose.yml gets the same two vars with staging defaults so the local Grafana's file provisioning parses cleanly (local has no AWS creds, so the rule never actually evaluates).

Fixes CS-11107.

Test plan

  • ./scripts/lint.sh clean (JSON parses, manifests valid, shellcheck clean)
  • envsubst dry-run with staging values yields name=ecs-boxel-worker-staging, accountId=680542703984, correct ARN
  • envsubst dry-run with production values yields name=ecs-boxel-worker-production, accountId=120317779495, correct ARN
  • YAML env blocks parse cleanly (yq -e)
  • shellcheck on apply.sh + apply-alerting.sh clean
  • Stage-1 (merge): observability-apply-staging.yml runs against staging. Confirm both rules still evaluate cleanly there (they did before; this should be a no-op for staging).
  • Stage-2 (manual dispatch from main): observability-apply-production.yml. Tail prod Grafana logs for ≥1h after apply and verify no ResourceNotFoundException for rule_uid=berxlrqsnzoxse or eerxlrqsghzwgd.

🤖 Generated with Claude Code

The same provisioning/alerting/worker-status-group.json ships to both
staging and production Grafana, but the file had the staging log group
(ecs-boxel-worker-staging, account 680542703984) hardcoded into both
the Worker Reap and Worker Startup rules. Prod Grafana queries with
the prod task role hit ResourceNotFoundException on every evaluation
(once per minute × 3 retries × 2 rules), flooding the alert pipeline.

Replace the hardcoded values with ${WORKER_LOG_GROUP_NAME} and
${WORKER_LOG_GROUP_ACCOUNT_ID} — the existing apply-alerting.sh
already supports ${VAR} substitution from the shell environment, and
its header explicitly calls out log-group / account-id parameterization
as the motivating use case. Per-env values are set in the job-level
env: block of each apply workflow (staging → ecs-boxel-worker-staging /
680542703984; production → ecs-boxel-worker-production / 120317779495).

Hardcoded in the workflow rather than fetched from SSM: the names follow
the deterministic ecs-boxel-worker-<env> convention and the account-id
is already implicit in AWS_ROLE_ARN. docker-compose.yml gets the same
two vars with staging defaults so the local Grafana's file provisioning
parses cleanly (local has no AWS creds, so the rule never evaluates).

Fixes CS-11107.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Observability diff (vs staging)

No dashboard / folder changes detected against the staging Grafana.

(Run: https://github.com/cardstack/boxel/actions/runs/25763899826)

@lukemelia lukemelia marked this pull request as ready for review May 12, 2026 21:42
@lukemelia lukemelia requested review from a team and backspace May 12, 2026 22:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Host Test Results

    1 files      1 suites   1h 43m 41s ⏱️
2 658 tests 2 643 ✅ 15 💤 0 ❌
2 677 runs  2 662 ✅ 15 💤 0 ❌

Results for commit af590ca.

Realm Server Test Results

    1 files      1 suites   11m 36s ⏱️
1 334 tests 1 334 ✅ 0 💤 0 ❌
1 413 runs  1 413 ✅ 0 💤 0 ❌

Results for commit af590ca.

@lukemelia lukemelia merged commit abc476a into main May 13, 2026
79 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants