Skip to content

feat(freshness-monitor): Phase 3 Lambda — every-15min absence-driven artifact probe#335

Merged
cipher813 merged 1 commit into
mainfrom
feat/freshness-monitor-lambda
May 27, 2026
Merged

feat(freshness-monitor): Phase 3 Lambda — every-15min absence-driven artifact probe#335
cipher813 merged 1 commit into
mainfrom
feat/freshness-monitor-lambda

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Phase 3 of the artifact-freshness-monitor arc (plan doc at ~/Development/alpha-engine-docs/private/artifact-freshness-monitor-260527.md; ROADMAP P1 entry in alpha-engine-config #342 + status update in #344). The absence-driven complement to flow-doctor / SF Catch / substrate-health-check (all event-driven). Closes the silent absence-of-artifact bug class — the 2026-05-17→27 pit_parity.json (load-bearing artifact silently absent for 11 days) + the sibling factor-profiles orphan + missing-signals.json incidents.

Depends on:

Changes

infrastructure/lambdas/freshness-monitor/ (new Lambda):

  • index.py — EventBridge cron handler (every 15min):

    1. load_registry(s3, REGISTRY_BUCKET, REGISTRY_KEY) — fetches _freshness_monitor/ARTIFACT_REGISTRY.yaml from S3, merges defaults block into each entry, instantiates ArtifactSpec.
    2. Walks every spec, calls alpha_engine_lib.artifact_freshness.check_freshness per row, isolates per-spec exceptions (_check_one wraps with a probe_failed fallback so a bad row doesn't sink the pass).
    3. Emits _freshness_monitor/check_results.json (dashboard surface in Phase 5) + _freshness_monitor/heartbeat.json (self-heartbeat per plan §3 invariant 9 — the monitor monitors itself).
    4. For misses past SLA (state ∈ {missing, stale, probe_failed}), routes to alpha_engine_lib.alerts.publish with dedup_key=resolve_dedup_key(spec, now) — collapses 4×/hour retries to one alert per cycle per artifact.
    5. probe_failed always escalates to severity=critical regardless of spec — the monitor itself is broken; operator must know (plan §3 invariant 6).
    6. OBSERVE-mode gate: MNEMON_FRESHNESS_MONITOR_ENABLED env var (default unset = false) suppresses alerts but emits results. Phase 6 cutover flips via aws lambda update-function-configuration without redeploying — mirrors mnemon 0.7.0rc4 pattern from 2026-05-24.
  • requirements.txt — pinned to alpha-engine-lib@v0.40.0 + pyyaml.

  • iam-policy.json — Logs + Telegram SSM params + alpha-engine-alerts SNS publish + S3 HeadObject/GetObject on alpha-engine-research + S3 PutObject scoped to _freshness_monitor/ + _alerts/_dedup/.

  • deploy.sh--bootstrap (IAM role + Lambda + EventBridge cron cron(*/15 * * * ? *) + cron permission), code-update path, registry upload from local alpha-engine-config clone to S3. Validates registry locally via alpha-engine-config/scripts/validate_artifact_registry.py BEFORE upload — malformed YAML never reaches S3. Mirrors the sf-telegram-notifier deploy.sh shape; managed outside CFN per the existing operator-deployed-only Lambda pattern.

  • test_handler.py — 12 unit tests covering:

    • load_registry (defaults merge, per-entry override, ISO-string date coercion, missing-artifacts raise)
    • Handler OBSERVE-mode does not alert but emits heartbeat
    • Handler alerts-enabled fires alerts with resolved dedup_key
    • probe_failed routes to critical severity regardless of spec
    • Per-spec exception (e.g., unsupported placeholder) classified as probe_failed without sinking the rest of the pass
    • Env-flip cutover (OBSERVE → production) without code change
    • _maybe_alert per-state coverage (fresh skip, within-SLA-grace skip, missing-past-SLA fires, probe_failed bumps severity)

Deploy plan (post-merge, operator-driven)

# First-time bootstrap (creates IAM role + Lambda + EventBridge cron + uploads registry)
bash infrastructure/lambdas/freshness-monitor/deploy.sh --bootstrap

# Subsequent code or registry updates
bash infrastructure/lambdas/freshness-monitor/deploy.sh

Bootstrap default state: MNEMON_FRESHNESS_MONITOR_ENABLED=false (OBSERVE). Cutover (Phase 6) flips via:

aws lambda update-function-configuration \
  --function-name alpha-engine-freshness-monitor \
  --environment 'Variables={LOG_LEVEL=INFO,MNEMON_FRESHNESS_MONITOR_ENABLED=true}'

Soak plan (Phase 6)

≥2 weekly cycles in OBSERVE mode. Earliest cutover ~2026-06-13 if this PR lands before Sat 5/30 (so the next 5/30 + 6/6 Saturday SFs exercise the probe under realistic load); more realistically ~2026-06-20.

During soak: dashboard surface (Phase 5) populates with real check_results; operator validates alert payloads against known-good Saturday firings (no false alarms on holidays, recovery-SF substitution working, dedup window collapsing retry storms correctly).

Acceptance (plan §7)

  • pytest test_handler.py — 12 passing
  • Post-deploy: first cron firing within 15min emits _freshness_monitor/heartbeat.json to S3
  • Post-deploy: simulated pit_parity-class silent failure (mock: rename the expected S3 key) fires Telegram with correct dedup key within ~15min
  • Post-deploy: NYSE-holiday Monday (next is 2026-06-19 Juneteenth) produces zero alerts for weekday_sf artifacts despite cron firing
  • Phase 5 dashboard PR consumes check_results.json

Test plan

Composes with

  • alpha-engine-lib #83 — the artifact_freshness substrate (ArtifactSpec, check_freshness, resolve_dedup_key)
  • alpha-engine-config #344 — the registry SoT
  • [[feedback_no_silent_fails]] — operationalizes the rule for the absence-of-artifact case
  • [[feedback_sota_institutional_default_no_shortcuts]] — registry-as-config (Google SRE Ch. 4 SLI/SLO pattern) + S3-loaded substrate over hardcoded specs
  • [[feedback_lift_invariants_to_chokepoint_after_second_recurrence]] — Phase 4 CI-guard cascade ships next, mirroring the test_schema_contract.py chokepoint pattern across the 4 producing repos

🤖 Generated with Claude Code

…artifact probe

Phase 3 of the artifact-freshness-monitor arc (plan doc at
~/Development/alpha-engine-docs/private/artifact-freshness-monitor-260527.md;
ROADMAP P1 entry in alpha-engine-config #342 / #344). The
absence-driven complement to flow-doctor / SF Catch /
substrate-health-check (all event-driven). Closes the silent
absence-of-artifact bug class — 2026-05-17→27 pit_parity.json + the
sibling factor-profiles orphan + missing-signals.json incidents.

Surface (infrastructure/lambdas/freshness-monitor/):

- index.py — EventBridge cron handler (every 15min):
  1. load_registry(s3, REGISTRY_BUCKET, REGISTRY_KEY) — fetches
     ARTIFACT_REGISTRY.yaml from S3, merges defaults block into each
     entry, instantiates ArtifactSpec (per alpha-engine-config #344).
  2. Walks every spec, calls alpha_engine_lib.artifact_freshness.
     check_freshness per row, isolates per-spec exceptions so a bad
     row doesn't sink the whole pass.
  3. Emits _freshness_monitor/check_results.json (dashboard surface
     in Phase 5) + _freshness_monitor/heartbeat.json (self-heartbeat
     per plan §3 invariant 9 — the monitor monitors itself;
     substrate-health-check daily watches the heartbeat).
  4. For misses past SLA (state ∈ {missing, stale, probe_failed}),
     routes to alpha_engine_lib.alerts.publish with
     dedup_key=resolve_dedup_key(spec, now) — collapses 4×/hour
     retries to one alert per cycle per artifact.
  5. probe_failed always escalates to severity=critical regardless
     of spec — the monitor itself is broken; operator must know
     (plan §3 invariant 6).
  6. OBSERVE-mode gate: MNEMON_FRESHNESS_MONITOR_ENABLED env var
     (default unset = false) suppresses alerts but emits results.
     Phase 6 cutover flips via aws lambda update-function-configuration
     without redeploying — mirrors mnemon 0.7.0rc4 pattern from
     2026-05-24.

- requirements.txt — pinned to alpha-engine-lib@v0.40.0 (substrate
  introduced in lib #83) + pyyaml.

- iam-policy.json — Logs + Telegram SSM params + alpha-engine-alerts
  SNS publish + S3 HeadObject/GetObject on alpha-engine-research +
  S3 PutObject scoped to _freshness_monitor/ + _alerts/_dedup/.

- deploy.sh — bootstrap (IAM role + Lambda + EventBridge cron
  rule + cron permission), code-update path, registry upload from
  local alpha-engine-config clone to S3. Validates registry locally
  via alpha-engine-config/scripts/validate_artifact_registry.py
  BEFORE upload — malformed YAML never reaches S3. Mirrors the
  sf-telegram-notifier deploy.sh shape. Managed outside CFN per
  same rationale as sf-telegram-notifier / spot-orphan-reaper /
  changelog-cloudwatch-mirror.

- test_handler.py — 12 unit tests covering:
  * load_registry (defaults merge, per-entry override, ISO-string
    date coercion, missing-artifacts raise)
  * Handler OBSERVE-mode does not alert but emits heartbeat
  * Handler alerts-enabled fires alerts with resolved dedup_key
  * probe_failed routes to critical severity regardless of spec
  * Per-spec exception (e.g., unsupported placeholder) classified as
    probe_failed without sinking the rest of the pass
  * Env-flip cutover (OBSERVE → production) without code change
  * _maybe_alert per-state coverage (fresh skip, within-SLA-grace
    skip, missing-past-SLA fires, probe_failed bumps severity)

Phase 6 cutover: ≥2 weekly cycles in OBSERVE mode (earliest cutover
~2026-06-13 if this PR + #344 land before Sat 5/30; more realistically
~2026-06-20). Acceptance criteria per plan §7: simulated pit_parity-
class silent failure fires Telegram with correct dedup key within
~15min; NYSE-holiday Monday produces zero alerts despite cron firing;
failed Saturday SF + successful recovery-SF in same window produces
zero alerts (substitution working); per plan §11 risk register.

Composes with alpha-engine-lib #83 (substrate, v0.40.0) +
alpha-engine-config #344 (registry SoT + PR-time validator).
Phase 4 (CI guards across 4 producing repos) + Phase 5 (dashboard
surface) ship in follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit b6a2b18 into main May 27, 2026
1 check passed
@cipher813 cipher813 deleted the feat/freshness-monitor-lambda branch May 27, 2026 21:44
cipher813 added a commit that referenced this pull request May 27, 2026
…e deps (#337)

Problem: PR #335 placed the preflight test step (`python3 -m pytest
test_handler.py`) BEFORE the pip install step. The test imports
index.py which imports yaml + boto3 + alpha_engine_lib.{alerts,
artifact_freshness} at module load — none of which are guaranteed
in the caller's bare python. Brian's run hit ModuleNotFoundError
on yaml because his shell python lacked pyyaml. Same root cause
applied to the registry validator (also imports yaml).

Fix: restructure step ordering so deps install before consumers
run:

  0a. Syntax-check handler (no imports — works on bare python)
  0b. Verify ae-config clone present (path check only)
  1.  Install runtime deps into $PKG via pip --target
  1a. Validate registry with PYTHONPATH=$PKG (yaml available)
  1b. Install pytest into separate $TEST_DEPS dir; run handler
      tests with PYTHONPATH=$PKG:$TEST_DEPS (all deps available;
      pytest NOT bundled into Lambda zip)
  1c. Copy handler + zip Lambda package (unchanged)

The $TEST_DEPS separate dir keeps pytest + its transitive deps out
of the final Lambda zip — only runtime deps from requirements.txt
end up in the deployable artifact. Both temp dirs cleaned via the
existing trap.

End-to-end dry-run verified:
  index.py syntax OK
  ✅ ARTIFACT_REGISTRY.yaml validated: 48 artifacts + 27 grandfathered paths
  12 passed in 0.08s
  Packaged function.zip (22498727 bytes)

Why this matters: deploy.sh is the only path operators have to ship
this Lambda; if the preflight fails on Brian's machine the entire
arc's Phase 6 deploy is blocked. This is a one-character-deep bug
class (`if [[ -f test_handler.py ]]; then python3 -m pytest`
assumes deps available) that bit because PR #335 mirrored the
sf-telegram-notifier deploy.sh pattern without noting that
sf-telegram-notifier's test stubs all its 3rd-party imports via
sys.modules whereas this Lambda's test exercises real lib substrate.

Composes with the Phase 6 OBSERVE-mode deploy that's the only
remaining item on the artifact-freshness-monitor arc.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant