Split per-run activity + early biosim_run_id DB write (convergence PR2.5)#52
Merged
Merged
Conversation
…(PR2.5)
The monolithic submit_biosim_simulation_run_activity bundled submit + poll + HDF5
+ save in one call, so OmexSimWorkflow could not see biosimulations_run_id until
the whole thing returned. Split it into:
- submit_biosim_run_activity: cache-check + submit, returns biosim_run_id
immediately (plus cached_workflow_run on cache hit).
- poll_biosim_run_activity: poll + HDF5 + save BiosimulatorWorkflowRun.
This gives the workflow a moment to fire an activity-based DB write between submit
and poll. OmexSimWorkflowInput gained optional submission_run_id; when set, the
child calls update_run_status_activity right after submit to write
biosimulations_run_id onto the SimulationRunRecord -- well before poll completes.
SimulationRunWorkflow now passes each job_id as submission_run_id; the verification
path leaves it None so its behavior is unchanged.
OmexSimWorkflowOutput now exposes biosimulations_run_id (set after submit) on
its @workflow.query, so callers see the id even when the run later fails. The
status endpoint GET /simulations/{processing_id} does a hybrid read: workflow
query for live status, SimulationRunRecord for biosim_run_id.
UpdateRunStatusInput.status is now Optional (the early write carries only the run
id; the terminal write carries status). worker_main and temporal_fixtures now
register submit_biosim_run_activity + poll_biosim_run_activity in place of the
retired monolithic.
Docs updated: workflows-architecture.md Shape C marked implemented;
simulation-runs-convergence-plan.md PR2 deviation section records the follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SubmitBiosimRunActivityOutput: drop the redundant `cached` flag (it was always == `cached_workflow_run is not None`). Promote `biosimulations_run_id` from `Optional[str]` to `str` since the activity always returns one or raises. - biosim_runs/workflows.py: drop the now-unreachable `if biosimulations_run_id is None` guard; simplify the cache-hit check to `if cached_workflow_run is not None`. Inline the `_submit` and `_poll` private helpers into OmexSimWorkflow.run -- they were near-identical thin wrappers whose try/except only re-logged what the activity itself already logs. - biosim_runs/activities.py: idiomatic `cached = cached_runs[0] if cached_runs else None` instead of `len(cached_runs) > 0 and cached_runs[0]...`. - UpdateRunStatusInput: add a model_validator requiring at least one of `status` or `biosimulations_run_id` is set, so degenerate all-None calls fail loudly instead of silently bumping only `updated`. - simulations/router.py: collapse the hybrid-merge two-pass into one -- `id_by_run.get(job.job_id)` replaces the `in id_by_run` membership check plus the None-filter in the dict comprehension. No behavior change beyond the new validator surface. ruff + mypy clean; 27 targeted tests pass (router, runs_query, biosim_service parse, integration_local e2e). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top correctness finding from the code review of PR #52: the router started SimulationRunWorkflow BEFORE inserting the per-job SimulationRunRecord rows. A child OmexSimWorkflow that hit the cache could fire its early update_run_status_activity in ~milliseconds -- and since update_simulation_run uses Mongo update_one with no upsert, the call would silently match zero documents and the biosimulations_run_id would be lost. The whole point of the PR2.5 split (early DB visibility) was silently broken on cache-hit paths. Fix: invert the order in run_simulations -- check Temporal availability first (503 early if down), then insert all CREATED rows, then start the workflow. If start_workflow raises after inserts succeed, mark each row FAILED in a best-effort cleanup so they don't linger as CREATED forever (no workflow will ever move them out of that state). Return 503 to the caller. Tests: two new router tests - test_run_simulations_inserts_before_starting_workflow: asserts the call order is ["insert", "insert", "start"] for a 2-simulator submission. - test_run_simulations_marks_records_failed_when_start_workflow_raises: makes start_workflow raise; asserts both records were inserted and then updated with status="FAILED". ruff + mypy clean; 27 simulations/integration_local tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#4 (in-flight SimulationRunWorkflow loses biosim_run_id on replay): SimulationRunWorkflow now falls back to the nested output.biosimulator_workflow_run.biosim_run.id when the new output.biosimulations_run_id field is None. Child OmexSimWorkflowOutput payloads recorded under pre-PR2.5 code lack the new top-level field; pydantic defaults it to None on replay. The fallback uses the canonical id from the historical BiosimulatorWorkflowRun so the parent still records the external run id even mid-deploy. #3 (Temporal NonDeterministicWorkflowError for in-flight OmexSimWorkflow): The activity-sequence change inside OmexSimWorkflow (1 monolithic -> 2 split + optional 1 early-write) is structurally incompatible with histories recorded by the old code. An @activity.defn(name=...) alias can't fix it -- the *number* of scheduled activities also differs. The Temporal-correct workflow.patched() pattern is deferred (substantial churn for a one-time transition); the pragmatic mitigation is a worker drain before rollout. Docs: - docs/workflows-architecture.md gains a "Deploy considerations" section explaining the NDE risk, the drain procedure, the deferred workflow.patched alternative, and what replay-safety nets the PR2.5 code itself provides. - backend/CLAUDE.md's Deploy section gets a callout pointing to it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
Jun 4, 2026
Adds a "Known issues from the PR #52 code review" subsection to docs/workflows-architecture.md's Open items, cataloguing the 8 medium / lower-severity findings not addressed by PR #53. Each entry names the trigger, impact, and a concrete fix sketch so future work has a starting point. Also retires the "Hybrid status read for PR2.5" open item since that fix shipped as part of PR #53 itself. Medium: #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5 retry off-by-one). Lower: #7 (cache-hit stale workflow_id), #15 (updated bump on replay), #11 (lazy ImportError not caught), #12 (CancelledError captured), #9 (length=100 cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
Jun 5, 2026
…findings Cleanup: four high-severity findings from PR #52 review
This was referenced Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR2.5 of
docs/simulation-runs-convergence-plan.md— the follow-up the original PR2 plan called for, finally landed. Seedocs/workflows-architecture.mdShape C for the diagram.Summary
The monolithic
submit_biosim_simulation_run_activitybundled submit + poll + HDF5 + save in one activity, soOmexSimWorkflowcould not seebiosimulations_run_iduntil the whole thing returned. PR2.5 splits it into two activities and uses the gap between them to write the run id to the database early, via the already-canonical Temporal pattern: an activity writing to MongoDB.Activity split (in
biosim_runs/activities.py)submit_biosim_run_activity— cache-check + submit; returnsbiosimulations_run_idimmediately, plus the fullcached_workflow_runon cache hit (so the workflow can skip poll).poll_biosim_run_activity— poll until terminal + HDF5 (with 404-lag retries) + saveBiosimulatorWorkflowRun; returns the saved record.Workflow wiring
OmexSimWorkflowInputgains optionalsubmission_run_id: str | None = None. When set (onlySimulationRunWorkflowsets it; the verify path leaves itNone), the child firesupdate_run_status_activityright after submit, writingbiosimulations_run_idonto the matchingSimulationRunRecord.OmexSimWorkflowOutputgainsbiosimulations_run_id(set after submit; exposed via the workflow's existing@workflow.query). Set even when the run later endsFAILED, so callers see the id either way.UpdateRunStatusInput.statusis nowOptional— the early write carries only the run id; the terminal write carries status.Hybrid status read
GET /simulations/{processing_id}now readsSimulationRunRecords for the processing_id and fillsbiosimulations_run_idper job when the workflow query lacks it. Workflow query owns live status; DB owns the early run id metadata. Non-fatal on DB failure.Verify path: untouched semantics
The verify path (
OmexVerifyWorkflow → OmexSimWorkflow → activities) passessubmission_run_id=None, so the early DB write is skipped — verification behavior is byte-equivalent. The submit/poll split is functionally equivalent to the old monolithic end-to-end.Testing
ruff+mypy --strictclean.integration_local/test_simulations_run.py(the mock-backed e2e) passes — exercises the new submit/poll split + early write end-to-end.pytest -m "not integration": 76 passed, 3 failed. The 3 failures are all pre-existing/environmental, not PR2.5 regressions:test_slurm_service::test_slurm_job_submit—UnboundLocalError, needs SSH creds.test_sim_workflow— real-network simdata 404 (passes on retry, also failed on main earlier).test_omex_verify_workflow_GCS— real-network biosimulations.org currently returningFAILEDfor fresh submissions (verified the integration-marked siblingtest_simulation_run_workflowfails on main today with the same pattern).submission_run_id→ no early write; submit/poll have the same semantics).🤖 Generated with Claude Code