Fix in-process execution API startup for triggerer by shaealh · Pull Request #65993 · apache/airflow

shaealh · 2026-04-28T03:40:51Z

The triggerer uses the in-process Execution API through a2wsgi, but the app lifespan startup was scheduled on the background loop without waiting for it to complete. That could let triggerer startup continue while the Execution API was not fully initialized, and any startup failure could be hidden until later request handling.

This change waits for the in-process Execution API lifespan to complete before returning the transport, and surfaces startup timeout/failure immediately.

Tests:

uv run ruff check airflow-core/src/airflow/api_fastapi/execution_api/app.py airflow-core/tests/unit/api_fastapi/execution_api/test_app.py
AIRFLOW_HOME=/tmp/airflow-65945-test uv run pytest airflow-core/tests/unit/api_fastapi/execution_api/test_app.py -q --with-db-init

kaxil · 2026-04-29T12:25:38Z

+        try:
+            asyncio.run_coroutine_threadsafe(start_lifespan(self._cm, self.app), middleware.loop).result(
+                timeout=30
+            )
+        except TimeoutError as err:
+            raise RuntimeError("Timed out while starting the in-process execution API lifespan") from err


This fix improves observability (loud failure instead of silent hang) but does not actually resolve #65945. The reporter's stack trace shows their custom SlurmSharedFileTaskHandler.format crashing inside svcs.register_factory's log.debug() call -- the lifespan startup raises because of a buggy user log handler. After this PR merges they'll see a clear RuntimeError propagated from transport, but the triggerer will still fail to start; the user-side handler bug is the actual cause.

Two follow-ups worth considering: (a) call this out explicitly in the PR description so the reporter understands what behavior to expect post-merge; (b) think about whether the log.debug call inside the svcs/lifespan setup should be made more defensive (or routed through a logger that bypasses the user task handler) so a single broken user handler doesn't take down triggerer startup.

kaxil · 2026-04-29T12:25:39Z

-        asyncio.run_coroutine_threadsafe(start_lifespan(self._cm, self.app), middleware.loop)
+        try:
+            asyncio.run_coroutine_threadsafe(start_lifespan(self._cm, self.app), middleware.loop).result(
+                timeout=30


Hardcoded magic number. Lift to a module-level constant (e.g. _LIFESPAN_STARTUP_TIMEOUT_SECONDS = 30) so it's discoverable and tunable, and include the value in the RuntimeError message on line 367 so operators have a hint when a slow real-world startup gets clipped (f"Timed out after {_LIFESPAN_STARTUP_TIMEOUT_SECONDS}s ...").

Has 30s been sanity-checked against the slowest realistic deployment (cold filesystems, large dag_bag warmup, Postgres connection setup on a sleepy network)? If not, it might be worth bumping it or making it configurable.

kaxil · 2026-04-29T12:25:39Z


 import json
 import time
+from concurrent.futures import TimeoutError


This shadows the builtin TimeoutError. In Python 3.11+ they're the same class, but Airflow still supports 3.10 where concurrent.futures.TimeoutError and the builtin are distinct. Anywhere later in this module that does except TimeoutError will now silently mean concurrent.futures.TimeoutError.

Prefer from concurrent.futures import TimeoutError as FutureTimeoutError, or move the import inside transport() since that's the only use site.

kaxil · 2026-04-29T12:25:39Z

@@ -358,7 +359,12 @@ async def start_lifespan(cm: AsyncExitStack, app: FastAPI):

        self._cm = AsyncExitStack()


Pre-existing, but visible here: self._cm is set up to register the lifespan's exit handlers, yet I don't see anywhere in InProcessExecutionAPI that calls self._cm.aclose(). That means the lifespan's shutdown code (svcs registry teardown, anything yielded back from registered factories) never runs.

Since you're already touching this block, worth confirming whether the missing teardown is intentional (process exits anyway) or a separate latent bug worth filing.

kaxil · 2026-04-29T12:25:39Z

+        api._app = FastAPI(lifespan=lifespan)
+
+        api.transport
+


Stylistic nit: accessing a property purely for its side effect reads like a typo to a future reader. Prefer _ = api.transport or assert api.transport is not None to make the intent explicit.

kaxil · 2026-04-29T12:25:39Z

+    def test_transport_waits_for_lifespan_startup(self):
+        entered_lifespan = threading.Event()
+
+        @asynccontextmanager
+        async def lifespan(app):
+            await asyncio.sleep(0.05)
+            app.state.lifespan_called = True
+            entered_lifespan.set()
+            yield
+
+        api = InProcessExecutionAPI()
+        api._app = FastAPI(lifespan=lifespan)
+
+        api.transport
+
+        assert entered_lifespan.is_set()
+        assert api.app.state.lifespan_called
+
+    def test_transport_surfaces_lifespan_startup_errors(self):
+        @asynccontextmanager
+        async def lifespan(app):
+            raise RuntimeError("lifespan failed")
+            yield
+
+        api = InProcessExecutionAPI()
+        api._app = FastAPI(lifespan=lifespan)
+
+        with pytest.raises(RuntimeError, match="lifespan failed"):
+            api.transport
+


Both tests bypass api.app (the real cached_property that wires up dependency_overrides, dag_bag, and the actual Execution API routes at app.py:325-342) by assigning api._app = FastAPI(lifespan=lifespan) directly. That isolates the timeout/exception logic cleanly, which is fine for a unit test, but it doesn't exercise the real Execution API lifespan -- which is the thing that actually broke in #65945.

Consider also adding an integration-style test that uses the real api.app and asserts startup completes, plus one that injects a slow lifespan to drive the 30s timeout path end-to-end.

wjddn279 · 2026-05-03T07:03:28Z

@shaealh
This is not a fix for the issue #65945. The stack trace reported by the user was not produced by an error — it is just stack_info output:

https://github.com/hynek/svcs/blob/1a54b88cc8371f651d0f270c5ff3c21bb2671532/src/svcs/_core.py#L294-L303

The same logs can also be observed when the api_server starts up, and they are not a problem.

potiuk · 2026-05-11T01:24:25Z

@shaealh — There are 6 unresolved review thread(s) on this PR from @kaxil. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? Once you believe the feedback is addressed, mark the thread as resolved so the reviewer isn't re-pinged needlessly. Thanks!

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

potiuk · 2026-05-18T09:13:52Z

@shaealh — There are 6 unresolved review thread(s) on this PR from @kaxil. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? Once you believe the feedback is addressed, mark the thread as resolved so the reviewer isn't re-pinged needlessly. Thanks!

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Fix in-process execution API startup for triggerer

893b38e

boring-cyborg Bot added area:API Airflow's REST/HTTP API area:task-sdk labels Apr 28, 2026

shaealh marked this pull request as ready for review April 28, 2026 03:42

shaealh requested review from amoghrajesh, ashb and kaxil as code owners April 28, 2026 03:42

shaealh mentioned this pull request Apr 28, 2026

Triggerer not starting #65945

Open

2 tasks

kaxil reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix in-process execution API startup for triggerer#65993

Fix in-process execution API startup for triggerer#65993
shaealh wants to merge 1 commit into
apache:mainfrom
shaealh:shaealh/65945

shaealh commented Apr 28, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

kaxil Apr 29, 2026

Uh oh!

wjddn279 commented May 3, 2026 •

edited

Loading

Uh oh!

potiuk commented May 11, 2026

Uh oh!

potiuk commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -358,7 +359,12 @@ async def start_lifespan(cm: AsyncExitStack, app: FastAPI):

		self._cm = AsyncExitStack()

Conversation

shaealh commented Apr 28, 2026

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

wjddn279 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented May 11, 2026

Uh oh!

potiuk commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjddn279 commented May 3, 2026 •

edited

Loading