Skip to content

Avoid psutil lookup race when starting supervised task subprocesses#64995

Open
DaveT1991 wants to merge 3 commits intoapache:mainfrom
DaveT1991:fix/supervisor-fork-start-race
Open

Avoid psutil lookup race when starting supervised task subprocesses#64995
DaveT1991 wants to merge 3 commits intoapache:mainfrom
DaveT1991:fix/supervisor-fork-start-race

Conversation

@DaveT1991
Copy link
Copy Markdown
Contributor

Summary

Avoid a startup race in the task supervisor when starting forked task subprocesses.

WatchedSubprocess.start() currently wraps the freshly forked child with psutil.Process(pid) immediately. On some systems this can raise psutil.NoSuchProcess even though the child is the direct supervised process, which aborts task startup before the task instance is marked running and later shows up as a queued/failed mismatch.

This change uses a lightweight handle backed by os.waitpid/os.kill for directly forked children instead of performing an immediate psutil lookup, and adds a regression test to cover that startup path.

Closes #64974

Testing

  • Added regression coverage in task-sdk/tests/task_sdk/execution_time/test_supervisor.py
  • git diff --check

I could not run the Python test suite locally in this environment because python is not installed here.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 10, 2026

Nice. @kaxil @ashb @amoghrajesh -> that looks indeed plausible and the solution looks good as well. WDYT? That would explain a flurry of random task instance status not matching actually running tasks - there were quite a few reports about that one.

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm I don't like the look of this -- The other advantage of psutil has is pid cycle/reuse detection.

Let me examine the issue too

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced this is the error -- Given psutil reads from the /proc fs filesystem on linux, and that comes directly from Kernel memory my understanding is that this cannot be racey. Once the process pid exists the /proc fs is updated and available.

I'm going to need more evidence that this is actually a race. We used psutil because it's safer than just killing a pid.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 10, 2026

I'm going to need more evidence that this is actually a race. We used psutil because it's safer than just killing a pid.

Yes. I saw similar issues raised and similar races happening at scalenin reported - by a quick search, we likely need to look closer and look at similar issues being solved elsewhere.

But the issue is real - we have seen at least few reports from our users that in Airflow 3 there are running tasks that are reported as failed - but they are actually running in background - which would definitely match the pattern described here.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 10, 2026

And yes - I agree that solution likely could be better if we find that this kind of race is really the root cause.

@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 10, 2026

https://github.com/giampaolo/psutil/blob/v7.1.0/psutil/_pslinux.py#L1645-L1651:

            self._raise_if_zombie()
            # /proc/PID directory may still exist, but the files within
            # it may not, indicating the process is gone, see:
            # https://github.com/giampaolo/psutil/issues/2418
            if not os.path.exists(f"{self._procfs_path}/{pid}/stat"):
                raise NoSuchProcess(pid, name) from err
            raise

I think this is where the exception is coming from.

I wonder if this could be caused by something kube/containerd and a procfs proxy/causing a delay. I couldn't find any other reports about this

If that is the case I think I'd rather we do something like

try:
  proc = psutil.Process(pid)
except psutil.ProcessNotFound:
  time.sleep(0.1)
  proc = psutil.Process(pid)

However I'm still skeptical this is actually what's happening. I can find no other reports anywhere saying a pid can exist without /proc/pid/stat existing. If this kind of race was possible we can't be the first to hit it.

@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 10, 2026

What might be worth doing is catching the error, and then when it does read the data from stdout/stderr sockets -- it's possible something is written on one those.

However even then, if the process has died it's procfs entry should stay around until something wait()s it!

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 10, 2026

However I'm still skeptical this is actually what's happening. I can find no other reports anywhere saying a pid can exist without /proc/pid/stat existing. If this kind of race was possible we can't be the first to hit it.

I am also looking for those.

@DaveT1991
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @ashb. I've updated the approach based on your suggestion — instead of replacing psutil, the new version simply retries psutil.Process(pid) after a 100ms sleep when NoSuchProcess is raised, which preserves pid-reuse detection.

@DaveT1991 DaveT1991 requested a review from ashb April 10, 2026 17:19
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +556 to +557
monkeypatch.setattr("airflow.sdk.execution_time.supervisor.psutil.Process", flaky_process)
monkeypatch.setattr("airflow.sdk.execution_time.supervisor.time.sleep", lambda _: None)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monkeypatch targets here use a dotted-string that looks like a nested attribute path ("airflow.sdk.execution_time.supervisor.psutil.Process" / ".time.sleep"). With pytest's monkeypatch, the string form imports everything up to the last dot as a module, so this will try to import airflow.sdk.execution_time.supervisor.psutil and airflow.sdk.execution_time.supervisor.time (which don’t exist) and the test will error. Patch the attributes on the already-imported supervisor module objects instead (e.g. import airflow.sdk.execution_time.supervisor and set supervisor.psutil.Process / supervisor.time.sleep).

Copilot uses AI. Check for mistakes.
Comment on lines +444 to +449
def _psutil_process(pid: int) -> psutil.Process:
try:
return psutil.Process(pid)
except psutil.NoSuchProcess:
time.sleep(0.1)
return psutil.Process(pid)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this change replaces the immediate psutil lookup with an os.waitpid/os.kill-backed lightweight handle, but the implementation here still wraps the child PID in psutil.Process (with a retry). Either update the PR description to match the actual approach, or implement the described non-psutil handle so reviewers/users aren’t misled about the behavior and dependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +447 to +449
except psutil.NoSuchProcess:
time.sleep(0.1)
return psutil.Process(pid)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.sleep(0.1) is a new hard-coded retry delay. Given this module already centralizes timing knobs as constants (e.g. MIN_HEARTBEAT_INTERVAL, SOCKET_CLEANUP_TIMEOUT), consider extracting this delay (and possibly retry count/timeout) into a named constant or config-backed value so it’s easier to tune and reason about if the race needs longer retries on some platforms.

Copilot uses AI. Check for mistakes.
@kaxil
Copy link
Copy Markdown
Member

kaxil commented Apr 11, 2026

If you resolve comments from copilot, either add a comment if it is invalid or create a commit fixing it

@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 13, 2026

Note however: I am still very very skeptical this is the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task instance finished with state failed, but the task instance's state attribute is queued

5 participants