Google: Fix CloudComposer Dag-run empty-window success#67052
Conversation
Reorder the example Dag so each CloudComposerTriggerDAGRunOperator runs before its sibling CloudComposerDAGRunSensor, and have the sensors pull the freshly minted run id from XCom via the new templated composer_dag_run_id field. The previous "execution_range=[now -1d, now]" form was evaluated at Dag-parse time -- before the Composer environment existed -- so the empty-window fix from PR apache#67052 would have caused the sensor to wait forever, the same way it did when PR apache#61046 was reverted. Adds a defensive timeout on the Dag-run sensors so any future regression back into the windowed code path fails CI fast instead of hanging the worker. Leaves the external-task sensors on the legacy windowed path deliberately; the analogous bug in _check_task_instances_states is tracked at apache#67051 and the inline note flags that the example must be restructured at the same time when that issue lands. Adds unit coverage for: the new template_fields entry, Jinja rendering of an xcom_pull template, run-id-path success/wait/precedence in poke(), parse-time source guard against re-introducing datetime.now(), and the matching trigger paths (run-id branch precedence + polling). Ships a local simulation harness (dev/simulate_composer_system_test.py) that mocks all GCP surfaces and drives the example via dag.test() end-to-end so contributors without GCP credentials can reproduce the proof.
|
Hi @shahar1 -- addressed the system-test concern. Why PR #61046 had to be reverted (revert commit 14995f1): The example patch in the earlier revision of this PR tried to side-step the issue with execution_range=[datetime.now() - timedelta(1), datetime.now()], but that's evaluated at DAG-parse time -- roughly half an hour before the sensor pokes and before the Composer env exists. Same hang mode, just hidden. What this commit changes:
Verification I can run locally without GCP credentials:
Simulation result: Both the sync dag_run_sensor and the deferred defer_dag_run_sensor matched on the trigger ops' run id via the composer_dag_run_id branch. @VladaZakharova @MaksYermak -- would appreciate a confirmation pass against real Composer infra when convenient, since the simulation can only validate the control flow, not the actual REST API contracts. Drafted-by: Claude Code (Opus 4.7 1M); reviewed by @Vamsi-klu before posting |
Reorder the example Dag so each CloudComposerTriggerDAGRunOperator runs before its sibling CloudComposerDAGRunSensor, and have the sensors pull the freshly minted run id from XCom via the new templated composer_dag_run_id field. The previous "execution_range=[now -1d, now]" form was evaluated at Dag-parse time -- before the Composer environment existed -- so the empty-window fix from PR apache#67052 would have caused the sensor to wait forever, the same way it did when PR apache#61046 was reverted. Adds a defensive timeout on the Dag-run sensors so any future regression back into the windowed code path fails CI fast instead of hanging the worker. Leaves the external-task sensors on the legacy windowed path deliberately; the analogous bug in _check_task_instances_states is tracked at apache#67051 and the inline note flags that the example must be restructured at the same time when that issue lands. Adds unit coverage for: the new template_fields entry, Jinja rendering of an xcom_pull template, run-id-path success/wait/precedence in poke(), parse-time source guard against re-introducing datetime.now(), and the matching trigger paths (run-id branch precedence + polling). Ships a local simulation harness (dev/simulate_composer_system_test.py) that mocks all GCP surfaces and drives the example via dag.test() end-to-end so contributors without GCP credentials can reproduce the proof.
97b5fa6 to
66518ca
Compare
|
Hi @shahar1 @Vamsi-klu I ran the unit tests and local simulation for this PR. Environment:
Unit Test Results:
Local Simulation (
Note: I don't have GCP credentials to run the actual system tests against real Composer infrastructure. The unit tests and local simulation validate the control flow, but real GCP validation is still needed. |
OK: dag.test() completed for composer -- final state: success
the sync `dag_run_sensor` and deferred `defer_dag_run_sensor` both
matched on dag_run_id='manual__simulated' via the composer_dag_run_id branch. |
|
@nnguyen168 could you please run the existing System Test for this operator and shared the screenshot for the latest run? |
When asking to run the system tests, I always refer to a real GCP instance - simulating it against a mock isn't helpful (maybe with exception for very specific services in GCP that have emulators). |
Fix
CloudComposerDAGRunSensorandCloudComposerDAGRunTriggerso they only succeed when at least one Dag run falls inside the requested execution window and every in-window run is in an allowed state.Why this change is needed:
What changes in behavior:
composer_dag_run_idbehavior is unchangedWhat is covered:
composer_dag_run_idbranch (template_fields, Jinja rendering, range-vs-id precedence, parse-time source guard, trigger branch precedence, trigger polling)CloudComposerDAGRunSensorto thedag_run_idminted by an upstreamCloudComposerTriggerDAGRunOperator(via XCom on the newly templatedcomposer_dag_run_idfield). The example no longer relies onexecution_range, which was previously evaluated at parse time before the Composer env existed -- the failure mode that caused PR Fixed CloudComposerDAGRunSensor to return False when no runs exist in execution_range #61046 to be reverteddev/simulate_composer_system_test.py) that mocks all GCP surfaces and drives the example viadag.test()end-to-end, including the deferred sensor's trigger inline, so contributors without GCP credentials can reproduce the proofFollow-up:
CloudComposerExternalTaskSensorand trigger anti-pattern is tracked in CloudComposerExternalTaskSensor succeeds when all task instances are outside execution_range #67051; the example's external-task sensors are deliberately left on the legacy windowed path with an inline note so they can be reordered at the same timecloses: #57512
Was generative AI tooling used to co-author this PR?
Generated-by: Codex (GPT-5) and Claude Code (Opus 4.7 1M) following the guidelines