Avoid psutil lookup race when starting supervised task subprocesses by DaveT1991 · Pull Request #64995 · apache/airflow

DaveT1991 · 2026-04-10T04:46:12Z

Summary

Avoid a startup race in the task supervisor when starting forked task subprocesses.

WatchedSubprocess.start() currently wraps the freshly forked child with psutil.Process(pid) immediately. On some systems this can raise psutil.NoSuchProcess even though the child is the direct supervised process, which aborts task startup before the task instance is marked running and later shows up as a queued/failed mismatch.

This change uses a lightweight handle backed by os.waitpid/os.kill for directly forked children instead of performing an immediate psutil lookup, and adds a regression test to cover that startup path.

Closes #64974

Testing

Added regression coverage in task-sdk/tests/task_sdk/execution_time/test_supervisor.py
git diff --check

I could not run the Python test suite locally in this environment because python is not installed here.

potiuk · 2026-04-10T10:01:12Z

Nice. @kaxil @ashb @amoghrajesh -> that looks indeed plausible and the solution looks good as well. WDYT? That would explain a flurry of random task instance status not matching actually running tasks - there were quite a few reports about that one.

ashb

Mmmm I don't like the look of this -- The other advantage of psutil has is pid cycle/reuse detection.

Let me examine the issue too

ashb

I am not convinced this is the error -- Given psutil reads from the /proc fs filesystem on linux, and that comes directly from Kernel memory my understanding is that this cannot be racey. Once the process pid exists the /proc fs is updated and available.

I'm going to need more evidence that this is actually a race. We used psutil because it's safer than just killing a pid.

potiuk · 2026-04-10T10:36:48Z

I'm going to need more evidence that this is actually a race. We used psutil because it's safer than just killing a pid.

Yes. I saw similar issues raised and similar races happening at scalenin reported - by a quick search, we likely need to look closer and look at similar issues being solved elsewhere.

But the issue is real - we have seen at least few reports from our users that in Airflow 3 there are running tasks that are reported as failed - but they are actually running in background - which would definitely match the pattern described here.

potiuk · 2026-04-10T10:40:24Z

And yes - I agree that solution likely could be better if we find that this kind of race is really the root cause.

ashb · 2026-04-10T10:50:47Z

https://github.com/giampaolo/psutil/blob/v7.1.0/psutil/_pslinux.py#L1645-L1651:

            self._raise_if_zombie()
            # /proc/PID directory may still exist, but the files within
            # it may not, indicating the process is gone, see:
            # https://github.com/giampaolo/psutil/issues/2418
            if not os.path.exists(f"{self._procfs_path}/{pid}/stat"):
                raise NoSuchProcess(pid, name) from err
            raise

I think this is where the exception is coming from.

I wonder if this could be caused by something kube/containerd and a procfs proxy/causing a delay. I couldn't find any other reports about this

If that is the case I think I'd rather we do something like

try:
  proc = psutil.Process(pid)
except psutil.ProcessNotFound:
  time.sleep(0.1)
  proc = psutil.Process(pid)

However I'm still skeptical this is actually what's happening. I can find no other reports anywhere saying a pid can exist without /proc/pid/stat existing. If this kind of race was possible we can't be the first to hit it.

ashb · 2026-04-10T11:13:05Z

What might be worth doing is catching the error, and then when it does read the data from stdout/stderr sockets -- it's possible something is written on one those.

However even then, if the process has died it's procfs entry should stay around until something wait()s it!

potiuk · 2026-04-10T12:18:36Z

However I'm still skeptical this is actually what's happening. I can find no other reports anywhere saying a pid can exist without /proc/pid/stat existing. If this kind of race was possible we can't be the first to hit it.

I am also looking for those.

…acing psutil

DaveT1991 · 2026-04-10T13:27:15Z

Thanks for the feedback @ashb. I've updated the approach based on your suggestion — instead of replacing psutil, the new version simply retries psutil.Process(pid) after a 100ms sleep when NoSuchProcess is raised, which preserves pid-reuse detection.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-10T23:30:44Z

task-sdk/tests/task_sdk/execution_time/test_supervisor.py

+        monkeypatch.setattr("airflow.sdk.execution_time.supervisor.psutil.Process", flaky_process)
+        monkeypatch.setattr("airflow.sdk.execution_time.supervisor.time.sleep", lambda _: None)


The monkeypatch targets here use a dotted-string that looks like a nested attribute path ("airflow.sdk.execution_time.supervisor.psutil.Process" / ".time.sleep"). With pytest's monkeypatch, the string form imports everything up to the last dot as a module, so this will try to import airflow.sdk.execution_time.supervisor.psutil and airflow.sdk.execution_time.supervisor.time (which don’t exist) and the test will error. Patch the attributes on the already-imported supervisor module objects instead (e.g. import airflow.sdk.execution_time.supervisor and set supervisor.psutil.Process / supervisor.time.sleep).

Copilot · 2026-04-10T23:30:44Z

task-sdk/src/airflow/sdk/execution_time/supervisor.py

+def _psutil_process(pid: int) -> psutil.Process:
+    try:
+        return psutil.Process(pid)
+    except psutil.NoSuchProcess:
+        time.sleep(0.1)
+        return psutil.Process(pid)


The PR description says this change replaces the immediate psutil lookup with an os.waitpid/os.kill-backed lightweight handle, but the implementation here still wraps the child PID in psutil.Process (with a retry). Either update the PR description to match the actual approach, or implement the described non-psutil handle so reviewers/users aren’t misled about the behavior and dependencies.

Copilot · 2026-04-10T23:30:44Z

task-sdk/src/airflow/sdk/execution_time/supervisor.py

+    except psutil.NoSuchProcess:
+        time.sleep(0.1)
+        return psutil.Process(pid)


time.sleep(0.1) is a new hard-coded retry delay. Given this module already centralizes timing knobs as constants (e.g. MIN_HEARTBEAT_INTERVAL, SOCKET_CLEANUP_TIMEOUT), consider extracting this delay (and possibly retry count/timeout) into a named constant or config-backed value so it’s easier to tune and reason about if the race needs longer retries on some platforms.

kaxil · 2026-04-11T00:30:44Z

If you resolve comments from copilot, either add a comment if it is invalid or create a commit fixing it

ashb · 2026-04-13T12:40:06Z

Note however: I am still very very skeptical this is the fix

Handle supervisor fork startup race

6b77452

DaveT1991 requested review from amoghrajesh, ashb and kaxil as code owners April 10, 2026 04:46

boring-cyborg bot added the area:task-sdk label Apr 10, 2026

ashb reviewed Apr 10, 2026

View reviewed changes

ashb requested changes Apr 10, 2026

View reviewed changes

Address review: retry psutil.Process on NoSuchProcess instead of repl…

1511772

…acing psutil

DaveT1991 requested a review from ashb April 10, 2026 17:19

kaxil requested a review from Copilot April 10, 2026 19:55

Copilot AI reviewed Apr 10, 2026

View reviewed changes

kaxil requested a review from Copilot April 10, 2026 23:26

Copilot started reviewing on behalf of kaxil April 10, 2026 23:27 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Merge branch 'main' into fix/supervisor-fork-start-race

10a2a0b

		monkeypatch.setattr("airflow.sdk.execution_time.supervisor.psutil.Process", flaky_process)
		monkeypatch.setattr("airflow.sdk.execution_time.supervisor.time.sleep", lambda _: None)

Conversation

DaveT1991 commented Apr 10, 2026

Summary

Testing

Uh oh!

potiuk commented Apr 10, 2026

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk commented Apr 10, 2026

Uh oh!

potiuk commented Apr 10, 2026

Uh oh!

ashb commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashb commented Apr 10, 2026

Uh oh!

potiuk commented Apr 10, 2026

Uh oh!

DaveT1991 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil commented Apr 11, 2026

Uh oh!

ashb commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ashb commented Apr 10, 2026 •

edited

Loading