fix(lambda): COPY validators/ into Phase 2 image — closes 2-day canary rollback chain (since PR #254)#274
Merged
Conversation
…y rollback chain since PR #254 The Phase 2 Lambda CI deploy has been failing on every push since 2026-05-18 18:20Z. PR #254 (per-collector value-range validation, merged 5/16) added top-level imports from validators.price_validator import (...) to collectors/alternative.py + collectors/fundamentals.py, but did NOT add `COPY validators/` to the Dockerfile. Every push that touched a deploy.yml-triggering path since then built a fresh image, pushed v88, ran the canary, hit `No module named 'validators'` at module load, and rolled back to v87. 10 consecutive failed deploys (5/18-18:20Z through 5/20-00:25Z) sat unnoticed because the canary correctly rolled back so prod (v87) was unaffected — but the latent break blocked ANY new code from ever reaching `live`. Surfaced by the Wave-3 PR3-wave-2 deploy (#273) when Brian saw the rollback in the CI log. Fix: one-line `COPY validators/ ${LAMBDA_TASK_ROOT}/validators/` next to the other application-code COPY directives. Regression guard: new tests/test_dockerfile_copies_match_deployed _imports.py with two assertions — 1. `test_dockerfile_copies_validators_for_collectors_imports` — explicit pin for the validators/ case so the failure surfaces as a clear test name if a future refactor drops the COPY. 2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile _copied` — generic ast-scan over every deployed Python file (lambda/, weekly_collector.py, polygon_client.py, collectors/, store/, validators/). Module-scope `from <local_pkg> import ...` / `import <local_pkg>` where <local_pkg> is a directory with __init__.py at the repo root MUST appear in the Dockerfile's COPY directives — or in the explicitly-non-deployed allowlist (`tests`, `builders`, `infrastructure`, `rag`, `features`). New top-level imports of un-COPY'd packages fail in CI now, not in the post-merge canary. Suite 1399 → 1401 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P0 hotfix. Phase 2 Lambda CI deploy has been auto-rolling back to v87 on every push since 2026-05-18 18:20Z. 10 consecutive failed deploys sat unnoticed because the canary correctly rolls back, so prod (v87) was unaffected — but no new code has reached `live` for 2 days.
Root cause
PR #254 (per-collector value-range validation, merged 2026-05-16) added top-level imports:
```python
from validators.price_validator import (...)
```
to `collectors/alternative.py:61` + `collectors/fundamentals.py:70` — but did NOT add `COPY validators/` to the Dockerfile. So every `deploy.yml`-triggering push since #254 builds a fresh image where the Lambda fails at module load with `No module named 'validators'` and the canary rolls back to v87.
Discovery
Surfaced by my PR #273 (Wave-3 PR3-wave-2) deploy — the canary log showed the rollback in plain text:
```
Alias 'live' -> version 88
Running canary (dry_run=true)...
WARNING: Canary returned 'No module named 'validators''
Rolling back...
Rolled back to version 87
Error: Process completed with exit code 1.
```
Cross-checked via `gh run list --workflow=deploy.yml` — all 10 deploys since 2026-05-18 18:20Z failed; last success was 2026-05-17 01:44Z. The breaking commit is `2c1128e` (PR #254).
Fix
One Dockerfile line:
```dockerfile
COPY validators/ ${LAMBDA_TASK_ROOT}/validators/
```
Regression guard
New `tests/test_dockerfile_copies_match_deployed_imports.py` — 2 assertions:
Production impact
Zero impact to running Lambda. The canary rolled back every time, so v87 has been serving production this whole time. The latent break only blocked any new code from reaching `live` — which means:
Once this PR merges and the canary passes, those 4+ PRs' code will reach `live` for the first time on the next deploy.
Suite
1399 → 1401 green (+2 regression tests).
ROADMAP follow-up (per the new audit-findings rule)
Filing a P1 entry: investigate whether the canary rollback should also Telegram/email so the operator notices same-day instead of 2 days later. The current behaviour (silent rollback) is by-design-safe but failure-correlated-observability-blind in this exact way.
🤖 Generated with Claude Code