Skip to content

fix(sf-daily): export FLOW_DOCTOR_ENABLED=1 in every SSM command block#210

Merged
cipher813 merged 1 commit into
mainfrom
fix/sf-flow-doctor-enabled-on-ssm-commands
May 11, 2026
Merged

fix(sf-daily): export FLOW_DOCTOR_ENABLED=1 in every SSM command block#210
cipher813 merged 1 commit into
mainfrom
fix/sf-flow-doctor-enabled-on-ssm-commands

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • Add "export FLOW_DOCTOR_ENABLED=1" as the second line of every SSM RunShellScript block in step_function_daily.json (right after set -eo pipefail from fix(sf-daily): set -eo pipefail on every SSM RunShellScript block #209): CheckTradingDay, MorningEnrich, RunMorningPlanner, RunDaemon.
  • Extend test_sf_ssm_pipefail_wiring.py with test_weekday_sf_ssm_blocks_export_flow_doctor_enabled to pin the flag in every command array.

Why

alpha_engine_lib.logging.setup_logging only attaches FlowDoctorHandler when FLOW_DOCTOR_ENABLED=1 is set in env (lib logging.py:138). Otherwise ERROR-level logs go only to stdout — never entering flow-doctor's dispatch pipeline (email + GitHub issue + S3 changelog).

I just confirmed via SSM that the trading instance's /home/ec2-user/.alpha-engine.env has ANTHROPIC_API_KEY + FLOW_DOCTOR_GITHUB_TOKEN but does NOT have FLOW_DOCTOR_ENABLED.

Regression timeline

FLOW_DOCTOR_ENABLED=1 was originally injected by two systemd units:

  • alpha-engine/infrastructure/systemd/alpha-engine-morning.service:19
  • alpha-engine/infrastructure/systemd/alpha-engine-daemon.service:22

Both via Environment=FLOW_DOCTOR_ENABLED=1.

Per alpha-engine/CLAUDE.md these units were disabled on 2026-05-05 when the morning planner + daemon migrated from boot-triggered systemd to SF-triggered SSM. The alpha-engine-daemon unit is still systemd-managed (the SF does systemctl restart), so the daemon process inherits FLOW_DOCTOR_ENABLED=1 ✓. But MorningEnrich + RunMorningPlanner became direct SSM RunShellScript commands that only source .alpha-engine.env, and the env file never gained the flag ✗.

Smoking gun

On 2026-05-11 MorningEnrich emitted two ERROR-level logs from weekly_collector:

  1. "Pre-MorningEnrich constituents refresh failed" (RuntimeError sector-mapping-incomplete traceback)
  2. "Weekly collection finished with non-ok status=failed — exiting 1 to halt the pipeline. Per-collector statuses: {'constituents_preflight': 'error'}"

Neither fired flow-doctor. Zero flow-doctor entries exist in the 624-entry S3 changelog corpus since the s3 sink was wired ~2026-05-01 — same regression hiding the past ~2 weeks of silent MorningEnrich failures.

Test plan

  • pytest tests/test_sf_ssm_pipefail_wiring.py — 4/4 pass (3 existing + 1 new)
  • Full suite: pytest tests/ — 715 passed, 1 skipped
  • JSON syntax verified

Stacked on #209

This PR's diff is on top of #209 (set -eo pipefail). If #209 merges first, this rebases cleanly. Both touch the same SF JSON and the same wiring test file. Recommended merge order: #207#208#209 → this PR (#210).

Scope deliberately narrow

Saturday + EOD SF SSM command arrays NOT modified here. Those run on different instances (Saturday spot, dashboard) where module-side flow-doctor wiring may not be uniformly present — setting FLOW_DOCTOR_ENABLED=1 for a module whose setup_logging call doesn't pass flow_doctor_yaml raises RuntimeError at startup. A separate audit PR can extend coverage once each module's setup_logging call site is confirmed.

🤖 Generated with Claude Code

`alpha_engine_lib.logging.setup_logging` only attaches FlowDoctorHandler
when FLOW_DOCTOR_ENABLED=1 is set in env. Otherwise ERROR-level logs
go only to stdout — they never enter flow-doctor's dispatch pipeline
(email + GitHub issue + S3 changelog).

The trading instance's `/home/ec2-user/.alpha-engine.env` has
ANTHROPIC_API_KEY + FLOW_DOCTOR_GITHUB_TOKEN but does NOT have
FLOW_DOCTOR_ENABLED. This was a silent regression from the
2026-05-05 systemd→SSM migration: the disabled-but-retained systemd
units (`alpha-engine-morning.service`, `alpha-engine-daemon.service`)
had the flag baked in via `Environment=FLOW_DOCTOR_ENABLED=1`. SSM
`RunShellScript` only sources `.alpha-engine.env`, which never gained
the var. The daemon's systemd unit still applies (RunDaemon =
`systemctl restart …`), but MorningEnrich + RunMorningPlanner run as
direct SSM commands and ran without flow-doctor.

On 2026-05-11 MorningEnrich emitted two ERROR logs from
weekly_collector ('Pre-MorningEnrich constituents refresh failed' +
'Weekly collection finished with non-ok status=failed') and
flow-doctor never escalated. Zero flow-doctor entries exist in the
624-entry S3 changelog corpus since the s3 sink was wired ~2026-05-01,
which is the same regression hiding the past ~2 weeks of silent
MorningEnrich failures.

Fix: `export FLOW_DOCTOR_ENABLED=1` as the second line of every SSM
command array in step_function_daily.json (right after `set -eo
pipefail`). Keeps the contract version-controlled and survives
env-file drift or instance rebuilds.

New wiring test pins this. Saturday + EOD SFs not modified here —
they run on different instances with different module wiring; audit
deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 force-pushed the fix/sf-flow-doctor-enabled-on-ssm-commands branch from 567d0fd to b4aab91 Compare May 11, 2026 14:12
@cipher813 cipher813 merged commit 2fb8200 into main May 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/sf-flow-doctor-enabled-on-ssm-commands branch May 11, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant