feat(freshness-monitor): Phase 3 Lambda — every-15min absence-driven artifact probe#335
Merged
Merged
Conversation
…artifact probe Phase 3 of the artifact-freshness-monitor arc (plan doc at ~/Development/alpha-engine-docs/private/artifact-freshness-monitor-260527.md; ROADMAP P1 entry in alpha-engine-config #342 / #344). The absence-driven complement to flow-doctor / SF Catch / substrate-health-check (all event-driven). Closes the silent absence-of-artifact bug class — 2026-05-17→27 pit_parity.json + the sibling factor-profiles orphan + missing-signals.json incidents. Surface (infrastructure/lambdas/freshness-monitor/): - index.py — EventBridge cron handler (every 15min): 1. load_registry(s3, REGISTRY_BUCKET, REGISTRY_KEY) — fetches ARTIFACT_REGISTRY.yaml from S3, merges defaults block into each entry, instantiates ArtifactSpec (per alpha-engine-config #344). 2. Walks every spec, calls alpha_engine_lib.artifact_freshness. check_freshness per row, isolates per-spec exceptions so a bad row doesn't sink the whole pass. 3. Emits _freshness_monitor/check_results.json (dashboard surface in Phase 5) + _freshness_monitor/heartbeat.json (self-heartbeat per plan §3 invariant 9 — the monitor monitors itself; substrate-health-check daily watches the heartbeat). 4. For misses past SLA (state ∈ {missing, stale, probe_failed}), routes to alpha_engine_lib.alerts.publish with dedup_key=resolve_dedup_key(spec, now) — collapses 4×/hour retries to one alert per cycle per artifact. 5. probe_failed always escalates to severity=critical regardless of spec — the monitor itself is broken; operator must know (plan §3 invariant 6). 6. OBSERVE-mode gate: MNEMON_FRESHNESS_MONITOR_ENABLED env var (default unset = false) suppresses alerts but emits results. Phase 6 cutover flips via aws lambda update-function-configuration without redeploying — mirrors mnemon 0.7.0rc4 pattern from 2026-05-24. - requirements.txt — pinned to alpha-engine-lib@v0.40.0 (substrate introduced in lib #83) + pyyaml. - iam-policy.json — Logs + Telegram SSM params + alpha-engine-alerts SNS publish + S3 HeadObject/GetObject on alpha-engine-research + S3 PutObject scoped to _freshness_monitor/ + _alerts/_dedup/. - deploy.sh — bootstrap (IAM role + Lambda + EventBridge cron rule + cron permission), code-update path, registry upload from local alpha-engine-config clone to S3. Validates registry locally via alpha-engine-config/scripts/validate_artifact_registry.py BEFORE upload — malformed YAML never reaches S3. Mirrors the sf-telegram-notifier deploy.sh shape. Managed outside CFN per same rationale as sf-telegram-notifier / spot-orphan-reaper / changelog-cloudwatch-mirror. - test_handler.py — 12 unit tests covering: * load_registry (defaults merge, per-entry override, ISO-string date coercion, missing-artifacts raise) * Handler OBSERVE-mode does not alert but emits heartbeat * Handler alerts-enabled fires alerts with resolved dedup_key * probe_failed routes to critical severity regardless of spec * Per-spec exception (e.g., unsupported placeholder) classified as probe_failed without sinking the rest of the pass * Env-flip cutover (OBSERVE → production) without code change * _maybe_alert per-state coverage (fresh skip, within-SLA-grace skip, missing-past-SLA fires, probe_failed bumps severity) Phase 6 cutover: ≥2 weekly cycles in OBSERVE mode (earliest cutover ~2026-06-13 if this PR + #344 land before Sat 5/30; more realistically ~2026-06-20). Acceptance criteria per plan §7: simulated pit_parity- class silent failure fires Telegram with correct dedup key within ~15min; NYSE-holiday Monday produces zero alerts despite cron firing; failed Saturday SF + successful recovery-SF in same window produces zero alerts (substitution working); per plan §11 risk register. Composes with alpha-engine-lib #83 (substrate, v0.40.0) + alpha-engine-config #344 (registry SoT + PR-time validator). Phase 4 (CI guards across 4 producing repos) + Phase 5 (dashboard surface) ship in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
May 27, 2026
…e deps (#337) Problem: PR #335 placed the preflight test step (`python3 -m pytest test_handler.py`) BEFORE the pip install step. The test imports index.py which imports yaml + boto3 + alpha_engine_lib.{alerts, artifact_freshness} at module load — none of which are guaranteed in the caller's bare python. Brian's run hit ModuleNotFoundError on yaml because his shell python lacked pyyaml. Same root cause applied to the registry validator (also imports yaml). Fix: restructure step ordering so deps install before consumers run: 0a. Syntax-check handler (no imports — works on bare python) 0b. Verify ae-config clone present (path check only) 1. Install runtime deps into $PKG via pip --target 1a. Validate registry with PYTHONPATH=$PKG (yaml available) 1b. Install pytest into separate $TEST_DEPS dir; run handler tests with PYTHONPATH=$PKG:$TEST_DEPS (all deps available; pytest NOT bundled into Lambda zip) 1c. Copy handler + zip Lambda package (unchanged) The $TEST_DEPS separate dir keeps pytest + its transitive deps out of the final Lambda zip — only runtime deps from requirements.txt end up in the deployable artifact. Both temp dirs cleaned via the existing trap. End-to-end dry-run verified: index.py syntax OK ✅ ARTIFACT_REGISTRY.yaml validated: 48 artifacts + 27 grandfathered paths 12 passed in 0.08s Packaged function.zip (22498727 bytes) Why this matters: deploy.sh is the only path operators have to ship this Lambda; if the preflight fails on Brian's machine the entire arc's Phase 6 deploy is blocked. This is a one-character-deep bug class (`if [[ -f test_handler.py ]]; then python3 -m pytest` assumes deps available) that bit because PR #335 mirrored the sf-telegram-notifier deploy.sh pattern without noting that sf-telegram-notifier's test stubs all its 3rd-party imports via sys.modules whereas this Lambda's test exercises real lib substrate. Composes with the Phase 6 OBSERVE-mode deploy that's the only remaining item on the artifact-freshness-monitor arc. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 3 of the artifact-freshness-monitor arc (plan doc at
~/Development/alpha-engine-docs/private/artifact-freshness-monitor-260527.md; ROADMAP P1 entry in alpha-engine-config #342 + status update in #344). The absence-driven complement to flow-doctor / SF Catch / substrate-health-check (all event-driven). Closes the silent absence-of-artifact bug class — the 2026-05-17→27pit_parity.json(load-bearing artifact silently absent for 11 days) + the sibling factor-profiles orphan + missing-signals.jsonincidents.Depends on:
artifact_freshnesssubstrateARTIFACT_REGISTRY.yamlSoT + PR-time validatorChanges
infrastructure/lambdas/freshness-monitor/(new Lambda):index.py— EventBridge cron handler (every 15min):load_registry(s3, REGISTRY_BUCKET, REGISTRY_KEY)— fetches_freshness_monitor/ARTIFACT_REGISTRY.yamlfrom S3, mergesdefaultsblock into each entry, instantiatesArtifactSpec.alpha_engine_lib.artifact_freshness.check_freshnessper row, isolates per-spec exceptions (_check_onewraps with a probe_failed fallback so a bad row doesn't sink the pass)._freshness_monitor/check_results.json(dashboard surface in Phase 5) +_freshness_monitor/heartbeat.json(self-heartbeat per plan §3 invariant 9 — the monitor monitors itself).state ∈ {missing, stale, probe_failed}), routes toalpha_engine_lib.alerts.publishwithdedup_key=resolve_dedup_key(spec, now)— collapses 4×/hour retries to one alert per cycle per artifact.probe_failedalways escalates toseverity=criticalregardless of spec — the monitor itself is broken; operator must know (plan §3 invariant 6).MNEMON_FRESHNESS_MONITOR_ENABLEDenv var (default unset =false) suppresses alerts but emits results. Phase 6 cutover flips viaaws lambda update-function-configurationwithout redeploying — mirrors mnemon 0.7.0rc4 pattern from 2026-05-24.requirements.txt— pinned toalpha-engine-lib@v0.40.0+pyyaml.iam-policy.json— Logs + Telegram SSM params +alpha-engine-alertsSNS publish + S3 HeadObject/GetObject onalpha-engine-research+ S3 PutObject scoped to_freshness_monitor/+_alerts/_dedup/.deploy.sh—--bootstrap(IAM role + Lambda + EventBridge croncron(*/15 * * * ? *)+ cron permission), code-update path, registry upload from localalpha-engine-configclone to S3. Validates registry locally viaalpha-engine-config/scripts/validate_artifact_registry.pyBEFORE upload — malformed YAML never reaches S3. Mirrors thesf-telegram-notifierdeploy.sh shape; managed outside CFN per the existing operator-deployed-only Lambda pattern.test_handler.py— 12 unit tests covering:load_registry(defaults merge, per-entry override, ISO-string date coercion, missing-artifacts raise)_maybe_alertper-state coverage (fresh skip, within-SLA-grace skip, missing-past-SLA fires, probe_failed bumps severity)Deploy plan (post-merge, operator-driven)
Bootstrap default state:
MNEMON_FRESHNESS_MONITOR_ENABLED=false(OBSERVE). Cutover (Phase 6) flips via:aws lambda update-function-configuration \ --function-name alpha-engine-freshness-monitor \ --environment 'Variables={LOG_LEVEL=INFO,MNEMON_FRESHNESS_MONITOR_ENABLED=true}'Soak plan (Phase 6)
≥2 weekly cycles in OBSERVE mode. Earliest cutover ~2026-06-13 if this PR lands before Sat 5/30 (so the next 5/30 + 6/6 Saturday SFs exercise the probe under realistic load); more realistically ~2026-06-20.
During soak: dashboard surface (Phase 5) populates with real check_results; operator validates alert payloads against known-good Saturday firings (no false alarms on holidays, recovery-SF substitution working, dedup window collapsing retry storms correctly).
Acceptance (plan §7)
pytest test_handler.py— 12 passing_freshness_monitor/heartbeat.jsonto S3pit_parity-class silent failure (mock: rename the expected S3 key) fires Telegram with correct dedup key within ~15mincheck_results.jsonTest plan
pytest infrastructure/lambdas/freshness-monitor/test_handler.py -v— 12 passingbash infrastructure/lambdas/freshness-monitor/deploy.sh --dry-run(after merge of feat(schema): signals.schema.json market_regime to 3-class — Phase 1C #344 is in operator's local checkout)--smokeflag invokes once and confirms heartbeat landsComposes with
artifact_freshnesssubstrate (ArtifactSpec,check_freshness,resolve_dedup_key)[[feedback_no_silent_fails]]— operationalizes the rule for the absence-of-artifact case[[feedback_sota_institutional_default_no_shortcuts]]— registry-as-config (Google SRE Ch. 4 SLI/SLO pattern) + S3-loaded substrate over hardcoded specs[[feedback_lift_invariants_to_chokepoint_after_second_recurrence]]— Phase 4 CI-guard cascade ships next, mirroring thetest_schema_contract.pychokepoint pattern across the 4 producing repos🤖 Generated with Claude Code