observability: parameterize worker-status alert log group per env (CS-11107)#4796
Merged
lukemelia merged 1 commit intoMay 13, 2026
Conversation
The same provisioning/alerting/worker-status-group.json ships to both
staging and production Grafana, but the file had the staging log group
(ecs-boxel-worker-staging, account 680542703984) hardcoded into both
the Worker Reap and Worker Startup rules. Prod Grafana queries with
the prod task role hit ResourceNotFoundException on every evaluation
(once per minute × 3 retries × 2 rules), flooding the alert pipeline.
Replace the hardcoded values with ${WORKER_LOG_GROUP_NAME} and
${WORKER_LOG_GROUP_ACCOUNT_ID} — the existing apply-alerting.sh
already supports ${VAR} substitution from the shell environment, and
its header explicitly calls out log-group / account-id parameterization
as the motivating use case. Per-env values are set in the job-level
env: block of each apply workflow (staging → ecs-boxel-worker-staging /
680542703984; production → ecs-boxel-worker-production / 120317779495).
Hardcoded in the workflow rather than fetched from SSM: the names follow
the deterministic ecs-boxel-worker-<env> convention and the account-id
is already implicit in AWS_ROLE_ARN. docker-compose.yml gets the same
two vars with staging defaults so the local Grafana's file provisioning
parses cleanly (local has no AWS creds, so the rule never evaluates).
Fixes CS-11107.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Observability diff (vs staging)No dashboard / folder changes detected against the staging Grafana. (Run: https://github.com/cardstack/boxel/actions/runs/25763899826) |
Contributor
habdelra
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
provisioning/alerting/worker-status-group.jsonships to both staging and production Grafana, but had the staging log group (ecs-boxel-worker-staging, account680542703984) hardcoded into bothWorker ReapandWorker Startuprules. Prod Grafana queries via the prod task role hitResourceNotFoundExceptionon every evaluation — once per minute × 3 retries × 2 rules — flooding the alert pipeline (and SMTP retries on top of that, tracked separately).${WORKER_LOG_GROUP_NAME}and${WORKER_LOG_GROUP_ACCOUNT_ID}.apply-alerting.shalready supports${VAR}substitution from the shell environment, and its header explicitly calls out log-group / account-id parameterization as the motivating use case.env:block (staging →ecs-boxel-worker-staging/680542703984; production →ecs-boxel-worker-production/120317779495). Hardcoded rather than SSM-sourced: the names follow the deterministicecs-boxel-worker-<env>convention and the account-id is already implicit inAWS_ROLE_ARN.docker-compose.ymlgets the same two vars with staging defaults so the local Grafana's file provisioning parses cleanly (local has no AWS creds, so the rule never actually evaluates).Fixes CS-11107.
Test plan
./scripts/lint.shclean (JSON parses, manifests valid, shellcheck clean)envsubstdry-run with staging values yieldsname=ecs-boxel-worker-staging,accountId=680542703984, correct ARNenvsubstdry-run with production values yieldsname=ecs-boxel-worker-production,accountId=120317779495, correct ARNyq -e)shellcheckonapply.sh+apply-alerting.shcleanobservability-apply-staging.ymlruns against staging. Confirm both rules still evaluate cleanly there (they did before; this should be a no-op for staging).main):observability-apply-production.yml. Tail prod Grafana logs for ≥1h after apply and verify noResourceNotFoundExceptionforrule_uid=berxlrqsnzoxseoreerxlrqsghzwgd.🤖 Generated with Claude Code