Skip to content

Split per-run activity + early biosim_run_id DB write (convergence PR2.5)#52

Merged
jcschaff merged 4 commits into
mainfrom
feature/runs-split-activity-early-runid
Jun 4, 2026
Merged

Split per-run activity + early biosim_run_id DB write (convergence PR2.5)#52
jcschaff merged 4 commits into
mainfrom
feature/runs-split-activity-early-runid

Conversation

@jcschaff

@jcschaff jcschaff commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR2.5 of docs/simulation-runs-convergence-plan.md — the follow-up the original PR2 plan called for, finally landed. See docs/workflows-architecture.md Shape C for the diagram.

Summary

The monolithic submit_biosim_simulation_run_activity bundled submit + poll + HDF5 + save in one activity, so OmexSimWorkflow could not see biosimulations_run_id until the whole thing returned. PR2.5 splits it into two activities and uses the gap between them to write the run id to the database early, via the already-canonical Temporal pattern: an activity writing to MongoDB.

Activity split (in biosim_runs/activities.py)

  • submit_biosim_run_activity — cache-check + submit; returns biosimulations_run_id immediately, plus the full cached_workflow_run on cache hit (so the workflow can skip poll).
  • poll_biosim_run_activity — poll until terminal + HDF5 (with 404-lag retries) + save BiosimulatorWorkflowRun; returns the saved record.

Workflow wiring

  • OmexSimWorkflowInput gains optional submission_run_id: str | None = None. When set (only SimulationRunWorkflow sets it; the verify path leaves it None), the child fires update_run_status_activity right after submit, writing biosimulations_run_id onto the matching SimulationRunRecord.
  • OmexSimWorkflowOutput gains biosimulations_run_id (set after submit; exposed via the workflow's existing @workflow.query). Set even when the run later ends FAILED, so callers see the id either way.
  • UpdateRunStatusInput.status is now Optional — the early write carries only the run id; the terminal write carries status.
  • Worker + test fixtures: register the split pair in place of the retired monolithic.

Hybrid status read

  • GET /simulations/{processing_id} now reads SimulationRunRecords for the processing_id and fills biosimulations_run_id per job when the workflow query lacks it. Workflow query owns live status; DB owns the early run id metadata. Non-fatal on DB failure.

Verify path: untouched semantics

The verify path (OmexVerifyWorkflow → OmexSimWorkflow → activities) passes submission_run_id=None, so the early DB write is skipped — verification behavior is byte-equivalent. The submit/poll split is functionally equivalent to the old monolithic end-to-end.

Testing

  • ruff + mypy --strict clean.
  • All simulations unit tests pass (router, runs-query, parsing).
  • integration_local/test_simulations_run.py (the mock-backed e2e) passes — exercises the new submit/poll split + early write end-to-end.
  • Full pytest -m "not integration": 76 passed, 3 failed. The 3 failures are all pre-existing/environmental, not PR2.5 regressions:
    • test_slurm_service::test_slurm_job_submitUnboundLocalError, needs SSH creds.
    • test_sim_workflow — real-network simdata 404 (passes on retry, also failed on main earlier).
    • test_omex_verify_workflow_GCS — real-network biosimulations.org currently returning FAILED for fresh submissions (verified the integration-marked sibling test_simulation_run_workflow fails on main today with the same pattern).
  • PR2.5's verify-path code is structurally equivalent to main's monolithic (no submission_run_id → no early write; submit/poll have the same semantics).

🤖 Generated with Claude Code

jcschaff and others added 4 commits June 4, 2026 10:14
…(PR2.5)

The monolithic submit_biosim_simulation_run_activity bundled submit + poll + HDF5
+ save in one call, so OmexSimWorkflow could not see biosimulations_run_id until
the whole thing returned. Split it into:
  - submit_biosim_run_activity: cache-check + submit, returns biosim_run_id
    immediately (plus cached_workflow_run on cache hit).
  - poll_biosim_run_activity: poll + HDF5 + save BiosimulatorWorkflowRun.

This gives the workflow a moment to fire an activity-based DB write between submit
and poll. OmexSimWorkflowInput gained optional submission_run_id; when set, the
child calls update_run_status_activity right after submit to write
biosimulations_run_id onto the SimulationRunRecord -- well before poll completes.
SimulationRunWorkflow now passes each job_id as submission_run_id; the verification
path leaves it None so its behavior is unchanged.

OmexSimWorkflowOutput now exposes biosimulations_run_id (set after submit) on
its @workflow.query, so callers see the id even when the run later fails. The
status endpoint GET /simulations/{processing_id} does a hybrid read: workflow
query for live status, SimulationRunRecord for biosim_run_id.

UpdateRunStatusInput.status is now Optional (the early write carries only the run
id; the terminal write carries status). worker_main and temporal_fixtures now
register submit_biosim_run_activity + poll_biosim_run_activity in place of the
retired monolithic.

Docs updated: workflows-architecture.md Shape C marked implemented;
simulation-runs-convergence-plan.md PR2 deviation section records the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SubmitBiosimRunActivityOutput: drop the redundant `cached` flag (it was
  always == `cached_workflow_run is not None`). Promote `biosimulations_run_id`
  from `Optional[str]` to `str` since the activity always returns one or raises.
- biosim_runs/workflows.py: drop the now-unreachable `if biosimulations_run_id
  is None` guard; simplify the cache-hit check to `if cached_workflow_run is
  not None`. Inline the `_submit` and `_poll` private helpers into
  OmexSimWorkflow.run -- they were near-identical thin wrappers whose
  try/except only re-logged what the activity itself already logs.
- biosim_runs/activities.py: idiomatic `cached = cached_runs[0] if cached_runs
  else None` instead of `len(cached_runs) > 0 and cached_runs[0]...`.
- UpdateRunStatusInput: add a model_validator requiring at least one of
  `status` or `biosimulations_run_id` is set, so degenerate all-None calls
  fail loudly instead of silently bumping only `updated`.
- simulations/router.py: collapse the hybrid-merge two-pass into one --
  `id_by_run.get(job.job_id)` replaces the `in id_by_run` membership check
  plus the None-filter in the dict comprehension.

No behavior change beyond the new validator surface. ruff + mypy clean;
27 targeted tests pass (router, runs_query, biosim_service parse,
integration_local e2e).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top correctness finding from the code review of PR #52: the router started
SimulationRunWorkflow BEFORE inserting the per-job SimulationRunRecord rows.
A child OmexSimWorkflow that hit the cache could fire its early
update_run_status_activity in ~milliseconds -- and since update_simulation_run
uses Mongo update_one with no upsert, the call would silently match zero
documents and the biosimulations_run_id would be lost. The whole point of the
PR2.5 split (early DB visibility) was silently broken on cache-hit paths.

Fix: invert the order in run_simulations -- check Temporal availability first
(503 early if down), then insert all CREATED rows, then start the workflow.
If start_workflow raises after inserts succeed, mark each row FAILED in a
best-effort cleanup so they don't linger as CREATED forever (no workflow
will ever move them out of that state). Return 503 to the caller.

Tests: two new router tests
- test_run_simulations_inserts_before_starting_workflow: asserts the call
  order is ["insert", "insert", "start"] for a 2-simulator submission.
- test_run_simulations_marks_records_failed_when_start_workflow_raises:
  makes start_workflow raise; asserts both records were inserted and then
  updated with status="FAILED".

ruff + mypy clean; 27 simulations/integration_local tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#4 (in-flight SimulationRunWorkflow loses biosim_run_id on replay):
SimulationRunWorkflow now falls back to the nested
output.biosimulator_workflow_run.biosim_run.id when the new
output.biosimulations_run_id field is None. Child OmexSimWorkflowOutput
payloads recorded under pre-PR2.5 code lack the new top-level field;
pydantic defaults it to None on replay. The fallback uses the canonical
id from the historical BiosimulatorWorkflowRun so the parent still
records the external run id even mid-deploy.

#3 (Temporal NonDeterministicWorkflowError for in-flight OmexSimWorkflow):
The activity-sequence change inside OmexSimWorkflow (1 monolithic ->
2 split + optional 1 early-write) is structurally incompatible with
histories recorded by the old code. An @activity.defn(name=...) alias
can't fix it -- the *number* of scheduled activities also differs. The
Temporal-correct workflow.patched() pattern is deferred (substantial
churn for a one-time transition); the pragmatic mitigation is a worker
drain before rollout.

Docs:
- docs/workflows-architecture.md gains a "Deploy considerations"
  section explaining the NDE risk, the drain procedure, the deferred
  workflow.patched alternative, and what replay-safety nets the
  PR2.5 code itself provides.
- backend/CLAUDE.md's Deploy section gets a callout pointing to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit a61f6b3 into main Jun 4, 2026
5 checks passed
@jcschaff jcschaff deleted the feature/runs-split-activity-early-runid branch June 4, 2026 16:38
jcschaff added a commit that referenced this pull request Jun 4, 2026
Adds a "Known issues from the PR #52 code review" subsection to
docs/workflows-architecture.md's Open items, cataloguing the 8 medium /
lower-severity findings not addressed by PR #53. Each entry names the
trigger, impact, and a concrete fix sketch so future work has a
starting point.

Also retires the "Hybrid status read for PR2.5" open item since that
fix shipped as part of PR #53 itself.

Medium:  #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5
         retry off-by-one).
Lower:   #7 (cache-hit stale workflow_id), #15 (updated bump on replay),
         #11 (lazy ImportError not caught), #12 (CancelledError captured),
         #9 (length=100 cap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff added a commit that referenced this pull request Jun 5, 2026
…findings

Cleanup: four high-severity findings from PR #52 review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant