Skip to content

Fix ValueError when supervisor force-closes stuck sockets after timeout#67115

Merged
vatsrahul1001 merged 2 commits into
apache:mainfrom
boschglobal:bugfix/supervisor-fail-during-socket-cleanup
May 19, 2026
Merged

Fix ValueError when supervisor force-closes stuck sockets after timeout#67115
vatsrahul1001 merged 2 commits into
apache:mainfrom
boschglobal:bugfix/supervisor-fail-during-socket-cleanup

Conversation

@AutomationDev85
Copy link
Copy Markdown
Contributor

@AutomationDev85 AutomationDev85 commented May 18, 2026

Overview

Some Edge workers intermittently fail with the following error:

2026-05-17T09:39:34.839980Z [error    ] Task execution failed          [airflow.providers.edge3.cli.worker] loc=worker.py:226
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/edge3/cli/worker.py", line 213, in _run_job_via_supervisor
    supervise(
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 2107, in supervise
    exit_code = process.wait()
                ^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 1062, in wait
    self._monitor_subprocess()
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 1127, in _monitor_subprocess
    alive = self._service_subprocess(max_wait_time=max_wait_time) is None
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/supervisor.py", line 791, in _service_subprocess
    events = self.selector.select(timeout=timeout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/python/lib/python3.12/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: I/O operation on closed epoll object

After analyzing the issue, we found that after the socket cleanup timeout fires and _cleanup_open_sockets() closes the selector, the monitor loop was not exited. On the next iteration it called selector.select() on the already-closed epoll object, causing the ValueError.

Details of change:

  • Add a break to exit the monitor loop immediately after the forced socket cleanup, preventing any further calls to selector.select() on the closed selector.
  • Call _open_sockets.clear() inside _cleanup_open_sockets() to keep the socket registry consistent with the selector state after cleanup.
  • Adapt unit test.

@jscheffl jscheffl added this to the Airflow 3.2.2 milestone May 18, 2026
@jscheffl jscheffl added the type:bug-fix Changelog: Bug Fixes label May 18, 2026
@AutomationDev85 AutomationDev85 force-pushed the bugfix/supervisor-fail-during-socket-cleanup branch from 7a58932 to 9465ed7 Compare May 19, 2026 06:45
@vatsrahul1001 vatsrahul1001 added ready for maintainer review Set after triaging when all criteria pass. backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch labels May 19, 2026
Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking nit, otherwise LGTM

Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py Outdated
@ashb
Copy link
Copy Markdown
Member

ashb commented May 19, 2026

Interesting. I sort of hoped that the stuck socket timeout never fired.

@vatsrahul1001 vatsrahul1001 merged commit 9bb5ff3 into apache:main May 19, 2026
112 of 113 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test PR Link

vatsrahul1001 pushed a commit that referenced this pull request May 19, 2026
… after timeout (#67115) (#67162)

* Fix ValueError when supervisor force-closes stuck sockets after timeout

* Improve mock socket spec

---------
(cherry picked from commit 9bb5ff3)

Co-authored-by: AutomationDev85 <96178949+AutomationDev85@users.noreply.github.com>
vatsrahul1001 pushed a commit that referenced this pull request May 20, 2026
… after timeout (#67115) (#67162)

* Fix ValueError when supervisor force-closes stuck sockets after timeout

* Improve mock socket spec

---------
(cherry picked from commit 9bb5ff3)

Co-authored-by: AutomationDev85 <96178949+AutomationDev85@users.noreply.github.com>
vatsrahul1001 pushed a commit that referenced this pull request May 20, 2026
… after timeout (#67115) (#67162)

* Fix ValueError when supervisor force-closes stuck sockets after timeout

* Improve mock socket spec

---------
(cherry picked from commit 9bb5ff3)

Co-authored-by: AutomationDev85 <96178949+AutomationDev85@users.noreply.github.com>
vatsrahul1001 pushed a commit that referenced this pull request May 21, 2026
… after timeout (#67115) (#67162)

* Fix ValueError when supervisor force-closes stuck sockets after timeout

* Improve mock socket spec

---------
(cherry picked from commit 9bb5ff3)

Co-authored-by: AutomationDev85 <96178949+AutomationDev85@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch ready for maintainer review Set after triaging when all criteria pass. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants