Skip to content

fix(lambda): COPY validators/ into Phase 2 image — closes 2-day canary rollback chain (since PR #254)#274

Merged
cipher813 merged 1 commit into
mainfrom
fix/lambda-dockerfile-copy-validators
May 20, 2026
Merged

fix(lambda): COPY validators/ into Phase 2 image — closes 2-day canary rollback chain (since PR #254)#274
cipher813 merged 1 commit into
mainfrom
fix/lambda-dockerfile-copy-validators

Conversation

@cipher813
Copy link
Copy Markdown
Owner

P0 hotfix. Phase 2 Lambda CI deploy has been auto-rolling back to v87 on every push since 2026-05-18 18:20Z. 10 consecutive failed deploys sat unnoticed because the canary correctly rolls back, so prod (v87) was unaffected — but no new code has reached `live` for 2 days.

Root cause

PR #254 (per-collector value-range validation, merged 2026-05-16) added top-level imports:

```python
from validators.price_validator import (...)
```

to `collectors/alternative.py:61` + `collectors/fundamentals.py:70` — but did NOT add `COPY validators/` to the Dockerfile. So every `deploy.yml`-triggering push since #254 builds a fresh image where the Lambda fails at module load with `No module named 'validators'` and the canary rolls back to v87.

Discovery

Surfaced by my PR #273 (Wave-3 PR3-wave-2) deploy — the canary log showed the rollback in plain text:

```
Alias 'live' -> version 88
Running canary (dry_run=true)...
WARNING: Canary returned 'No module named 'validators''
Rolling back...
Rolled back to version 87
Error: Process completed with exit code 1.
```

Cross-checked via `gh run list --workflow=deploy.yml` — all 10 deploys since 2026-05-18 18:20Z failed; last success was 2026-05-17 01:44Z. The breaking commit is `2c1128e` (PR #254).

Fix

One Dockerfile line:

```dockerfile
COPY validators/ ${LAMBDA_TASK_ROOT}/validators/
```

Regression guard

New `tests/test_dockerfile_copies_match_deployed_imports.py` — 2 assertions:

  1. `test_dockerfile_copies_validators_for_collectors_imports` — explicit pin for the validators/ case (clear test name if a future refactor drops the COPY).
  2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile_copied` — generic ast-scan over every deployed Python file (`lambda/`, `weekly_collector.py`, `polygon_client.py`, `collectors/`, `store/`, `validators/`). Module-scope `from <local_pkg> import ...` / `import <local_pkg>` where `<local_pkg>` resolves to a local directory MUST appear in the Dockerfile's COPY directives — or in the explicit non-deployed allowlist (`tests`, `builders`, `infrastructure`, `rag`, `features`). Catches this regression class at PR time, not in the post-merge canary.

Production impact

Zero impact to running Lambda. The canary rolled back every time, so v87 has been serving production this whole time. The latent break only blocked any new code from reaching `live` — which means:

Once this PR merges and the canary passes, those 4+ PRs' code will reach `live` for the first time on the next deploy.

Suite

1399 → 1401 green (+2 regression tests).

ROADMAP follow-up (per the new audit-findings rule)

Filing a P1 entry: investigate whether the canary rollback should also Telegram/email so the operator notices same-day instead of 2 days later. The current behaviour (silent rollback) is by-design-safe but failure-correlated-observability-blind in this exact way.

🤖 Generated with Claude Code

…y rollback chain since PR #254

The Phase 2 Lambda CI deploy has been failing on every push since
2026-05-18 18:20Z. PR #254 (per-collector value-range validation,
merged 5/16) added top-level imports

  from validators.price_validator import (...)

to collectors/alternative.py + collectors/fundamentals.py, but did NOT
add `COPY validators/` to the Dockerfile. Every push that touched a
deploy.yml-triggering path since then built a fresh image, pushed v88,
ran the canary, hit `No module named 'validators'` at module load,
and rolled back to v87. 10 consecutive failed deploys (5/18-18:20Z
through 5/20-00:25Z) sat unnoticed because the canary correctly rolled
back so prod (v87) was unaffected — but the latent break blocked ANY
new code from ever reaching `live`. Surfaced by the Wave-3 PR3-wave-2
deploy (#273) when Brian saw the rollback in the CI log.

Fix: one-line `COPY validators/ ${LAMBDA_TASK_ROOT}/validators/` next
to the other application-code COPY directives.

Regression guard: new tests/test_dockerfile_copies_match_deployed
_imports.py with two assertions —

1. `test_dockerfile_copies_validators_for_collectors_imports` —
   explicit pin for the validators/ case so the failure surfaces as
   a clear test name if a future refactor drops the COPY.

2. `test_every_toplevel_local_import_in_lambda_code_is_dockerfile
   _copied` — generic ast-scan over every deployed Python file
   (lambda/, weekly_collector.py, polygon_client.py, collectors/,
   store/, validators/). Module-scope `from <local_pkg> import ...`
   / `import <local_pkg>` where <local_pkg> is a directory with
   __init__.py at the repo root MUST appear in the Dockerfile's
   COPY directives — or in the explicitly-non-deployed allowlist
   (`tests`, `builders`, `infrastructure`, `rag`, `features`). New
   top-level imports of un-COPY'd packages fail in CI now, not in
   the post-merge canary.

Suite 1399 → 1401 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 2400aae into main May 20, 2026
1 check passed
@cipher813 cipher813 deleted the fix/lambda-dockerfile-copy-validators branch May 20, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant