Split per-run activity + early biosim_run_id DB write (convergence PR2.5) by jcschaff · Pull Request #52 · biosimulations/platform

jcschaff · 2026-06-04T14:15:26Z

PR2.5 of docs/simulation-runs-convergence-plan.md — the follow-up the original PR2 plan called for, finally landed. See docs/workflows-architecture.md Shape C for the diagram.

Summary

The monolithic submit_biosim_simulation_run_activity bundled submit + poll + HDF5 + save in one activity, so OmexSimWorkflow could not see biosimulations_run_id until the whole thing returned. PR2.5 splits it into two activities and uses the gap between them to write the run id to the database early, via the already-canonical Temporal pattern: an activity writing to MongoDB.

Activity split (in `biosim_runs/activities.py`)

submit_biosim_run_activity — cache-check + submit; returns biosimulations_run_id immediately, plus the full cached_workflow_run on cache hit (so the workflow can skip poll).
poll_biosim_run_activity — poll until terminal + HDF5 (with 404-lag retries) + save BiosimulatorWorkflowRun; returns the saved record.

Workflow wiring

OmexSimWorkflowInput gains optional submission_run_id: str | None = None. When set (only SimulationRunWorkflow sets it; the verify path leaves it None), the child fires update_run_status_activity right after submit, writing biosimulations_run_id onto the matching SimulationRunRecord.
OmexSimWorkflowOutput gains biosimulations_run_id (set after submit; exposed via the workflow's existing @workflow.query). Set even when the run later ends FAILED, so callers see the id either way.
UpdateRunStatusInput.status is now Optional — the early write carries only the run id; the terminal write carries status.
Worker + test fixtures: register the split pair in place of the retired monolithic.

Hybrid status read

GET /simulations/{processing_id} now reads SimulationRunRecords for the processing_id and fills biosimulations_run_id per job when the workflow query lacks it. Workflow query owns live status; DB owns the early run id metadata. Non-fatal on DB failure.

Verify path: untouched semantics

The verify path (OmexVerifyWorkflow → OmexSimWorkflow → activities) passes submission_run_id=None, so the early DB write is skipped — verification behavior is byte-equivalent. The submit/poll split is functionally equivalent to the old monolithic end-to-end.

Testing

ruff + mypy --strict clean.
All simulations unit tests pass (router, runs-query, parsing).
integration_local/test_simulations_run.py (the mock-backed e2e) passes — exercises the new submit/poll split + early write end-to-end.
Full pytest -m "not integration": 76 passed, 3 failed. The 3 failures are all pre-existing/environmental, not PR2.5 regressions:
- test_slurm_service::test_slurm_job_submit — UnboundLocalError, needs SSH creds.
- test_sim_workflow — real-network simdata 404 (passes on retry, also failed on main earlier).
- test_omex_verify_workflow_GCS — real-network biosimulations.org currently returning FAILED for fresh submissions (verified the integration-marked sibling test_simulation_run_workflow fails on main today with the same pattern).
PR2.5's verify-path code is structurally equivalent to main's monolithic (no submission_run_id → no early write; submit/poll have the same semantics).

🤖 Generated with Claude Code

…(PR2.5) The monolithic submit_biosim_simulation_run_activity bundled submit + poll + HDF5 + save in one call, so OmexSimWorkflow could not see biosimulations_run_id until the whole thing returned. Split it into: - submit_biosim_run_activity: cache-check + submit, returns biosim_run_id immediately (plus cached_workflow_run on cache hit). - poll_biosim_run_activity: poll + HDF5 + save BiosimulatorWorkflowRun. This gives the workflow a moment to fire an activity-based DB write between submit and poll. OmexSimWorkflowInput gained optional submission_run_id; when set, the child calls update_run_status_activity right after submit to write biosimulations_run_id onto the SimulationRunRecord -- well before poll completes. SimulationRunWorkflow now passes each job_id as submission_run_id; the verification path leaves it None so its behavior is unchanged. OmexSimWorkflowOutput now exposes biosimulations_run_id (set after submit) on its @workflow.query, so callers see the id even when the run later fails. The status endpoint GET /simulations/{processing_id} does a hybrid read: workflow query for live status, SimulationRunRecord for biosim_run_id. UpdateRunStatusInput.status is now Optional (the early write carries only the run id; the terminal write carries status). worker_main and temporal_fixtures now register submit_biosim_run_activity + poll_biosim_run_activity in place of the retired monolithic. Docs updated: workflows-architecture.md Shape C marked implemented; simulation-runs-convergence-plan.md PR2 deviation section records the follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- SubmitBiosimRunActivityOutput: drop the redundant `cached` flag (it was always == `cached_workflow_run is not None`). Promote `biosimulations_run_id` from `Optional[str]` to `str` since the activity always returns one or raises. - biosim_runs/workflows.py: drop the now-unreachable `if biosimulations_run_id is None` guard; simplify the cache-hit check to `if cached_workflow_run is not None`. Inline the `_submit` and `_poll` private helpers into OmexSimWorkflow.run -- they were near-identical thin wrappers whose try/except only re-logged what the activity itself already logs. - biosim_runs/activities.py: idiomatic `cached = cached_runs[0] if cached_runs else None` instead of `len(cached_runs) > 0 and cached_runs[0]...`. - UpdateRunStatusInput: add a model_validator requiring at least one of `status` or `biosimulations_run_id` is set, so degenerate all-None calls fail loudly instead of silently bumping only `updated`. - simulations/router.py: collapse the hybrid-merge two-pass into one -- `id_by_run.get(job.job_id)` replaces the `in id_by_run` membership check plus the None-filter in the dict comprehension. No behavior change beyond the new validator surface. ruff + mypy clean; 27 targeted tests pass (router, runs_query, biosim_service parse, integration_local e2e). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Top correctness finding from the code review of PR #52: the router started SimulationRunWorkflow BEFORE inserting the per-job SimulationRunRecord rows. A child OmexSimWorkflow that hit the cache could fire its early update_run_status_activity in ~milliseconds -- and since update_simulation_run uses Mongo update_one with no upsert, the call would silently match zero documents and the biosimulations_run_id would be lost. The whole point of the PR2.5 split (early DB visibility) was silently broken on cache-hit paths. Fix: invert the order in run_simulations -- check Temporal availability first (503 early if down), then insert all CREATED rows, then start the workflow. If start_workflow raises after inserts succeed, mark each row FAILED in a best-effort cleanup so they don't linger as CREATED forever (no workflow will ever move them out of that state). Return 503 to the caller. Tests: two new router tests - test_run_simulations_inserts_before_starting_workflow: asserts the call order is ["insert", "insert", "start"] for a 2-simulator submission. - test_run_simulations_marks_records_failed_when_start_workflow_raises: makes start_workflow raise; asserts both records were inserted and then updated with status="FAILED". ruff + mypy clean; 27 simulations/integration_local tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#4 (in-flight SimulationRunWorkflow loses biosim_run_id on replay): SimulationRunWorkflow now falls back to the nested output.biosimulator_workflow_run.biosim_run.id when the new output.biosimulations_run_id field is None. Child OmexSimWorkflowOutput payloads recorded under pre-PR2.5 code lack the new top-level field; pydantic defaults it to None on replay. The fallback uses the canonical id from the historical BiosimulatorWorkflowRun so the parent still records the external run id even mid-deploy. #3 (Temporal NonDeterministicWorkflowError for in-flight OmexSimWorkflow): The activity-sequence change inside OmexSimWorkflow (1 monolithic -> 2 split + optional 1 early-write) is structurally incompatible with histories recorded by the old code. An @activity.defn(name=...) alias can't fix it -- the *number* of scheduled activities also differs. The Temporal-correct workflow.patched() pattern is deferred (substantial churn for a one-time transition); the pragmatic mitigation is a worker drain before rollout. Docs: - docs/workflows-architecture.md gains a "Deploy considerations" section explaining the NDE risk, the drain procedure, the deferred workflow.patched alternative, and what replay-safety nets the PR2.5 code itself provides. - backend/CLAUDE.md's Deploy section gets a callout pointing to it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a "Known issues from the PR #52 code review" subsection to docs/workflows-architecture.md's Open items, cataloguing the 8 medium / lower-severity findings not addressed by PR #53. Each entry names the trigger, impact, and a concrete fix sketch so future work has a starting point. Also retires the "Hybrid status read for PR2.5" open item since that fix shipped as part of PR #53 itself. Medium: #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5 retry off-by-one). Lower: #7 (cache-hit stale workflow_id), #15 (updated bump on replay), #11 (lazy ImportError not caught), #12 (CancelledError captured), #9 (length=100 cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…findings Cleanup: four high-severity findings from PR #52 review

jcschaff and others added 4 commits June 4, 2026 10:14

jcschaff merged commit a61f6b3 into main Jun 4, 2026
5 checks passed

jcschaff deleted the feature/runs-split-activity-early-runid branch June 4, 2026 16:38

jcschaff mentioned this pull request Jun 4, 2026

Cleanup: four high-severity findings from PR #52 review #53

Merged

jcschaff added a commit that referenced this pull request Jun 5, 2026

Merge pull request #53 from biosimulations/feature/runs-cleanup-high-…

96ee556

…findings Cleanup: four high-severity findings from PR #52 review

This was referenced Jun 5, 2026

Docs: post-PR #53 status + frontend wire-up reality on main #54

Merged

Bump backend to 0.5.0 and repoint prod overlays #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split per-run activity + early biosim_run_id DB write (convergence PR2.5)#52

Split per-run activity + early biosim_run_id DB write (convergence PR2.5)#52
jcschaff merged 4 commits into
mainfrom
feature/runs-split-activity-early-runid

jcschaff commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented Jun 4, 2026

Summary

Activity split (in biosim_runs/activities.py)

Workflow wiring

Hybrid status read

Verify path: untouched semantics

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Activity split (in `biosim_runs/activities.py`)