[SPARK-51966][PYTHON] Replace select.select() with select.poll() when running on POSIX os #53306

wjszlachta-man · 2025-12-03T11:21:11Z

What changes were proposed in this pull request?

On glibc based Linux systems select() can monitor only file descriptor numbers that are less than FD_SETSIZE (1024).

This is an unreasonably low limit for many modern applications.

This PR replaces select.select() with select.poll() when running on POSIX os.

Why are the changes needed?

When running via pyspark we frequently observe:

Exception occurred during processing of request from ('127.0.0.1', 46334)
Traceback (most recent call last):
  File "/usr/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/usr/lib/python3.11/site-packages/pyspark/accumulators.py", line 293, in handle
    poll(authenticate_and_accum_updates)
  File "/usr/lib/python3.11/site-packages/pyspark/accumulators.py", line 266, in poll
    r, _, _ = select.select([self.rfile], [], [], 1)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

On POSIX systems poll() should be used instead of select().

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests + we have been running this change (combined with py4j/py4j#560) on our YARN cluster (Linux) since April 2025.

Was this patch authored or co-authored using generative AI tooling?

No

…osix On glibc based Linux systems select() can monitor only file descriptor numbers that are less than FD_SETSIZE (1024). This is an unreasonably low limit for many modern applications.

wjszlachta-man · 2025-12-03T11:23:48Z

This is identical to #50774 (which was never reviewed and closed by the bot), but rebased against current master.

wjszlachta-man · 2025-12-03T11:27:32Z

@HyukjinKwon is that something you could maybe review (in combination with py4j/py4j#560)?

We needed to implement this change to allow us to run 1000+ executors without running into filedescriptor out of range in select() error.

HyukjinKwon · 2025-12-04T00:25:11Z

Can we have an environment variable to fallback?

… select.select()

wjszlachta-man · 2025-12-04T15:36:55Z

@HyukjinKwon you can now use PYSPARK_FORCE_SELECT to fallback to select.select().

wjszlachta-man · 2025-12-09T12:51:12Z

@HyukjinKwon @gaogaotiantian - I see you now merged #53388

Can you merge a similar change to python/pyspark/accumulators.py as per this PR?

wjszlachta-man · 2025-12-09T13:11:28Z

See traceback:

Exception occurred during processing of request from ('127.0.0.1', 46334)
Traceback (most recent call last):
  File "/usr/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/usr/lib/python3.11/site-packages/pyspark/accumulators.py", line 293, in handle
    poll(authenticate_and_accum_updates)
  File "/usr/lib/python3.11/site-packages/pyspark/accumulators.py", line 266, in poll
    r, _, _ = select.select([self.rfile], [], [], 1)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

gaogaotiantian · 2025-12-09T16:53:25Z

Could you make sure the CI pass? I think it’s the linter issue. Also it has some conflicts now.

wjszlachta-man · 2025-12-09T17:27:57Z

Sure - let me rebase... (although conflict is because of #53388 - not sure why duplicate?)

Can we have an environment variable to fallback?

Do you still want environment variable to fallback? I see you don't have one in #53388

gaogaotiantian · 2025-12-09T18:34:34Z

Sure - let me rebase... (although conflict is because of #53388 - not sure why duplicate?)

That would be my fault. I did not realize this PR exists when I fix the worker. Whether we should have an envvar fallback is a decision for @HyukjinKwon . Personally I think it's okay to just replace the old mechanism. We would need a whole config path for the fallback to work. With the heavily executed CI I think we can validate this local scope change pretty well.

wjszlachta-man · 2025-12-09T18:42:42Z

Just updated the branch - this should fix conflict and overall is similar to your changes in python/pyspark/daemon.py.

Removed the fallback - if we have one it should be for both daemon.py and accumulators.py (personally I think it's redundant and should always use poll() if available).

As per your PR I updated poll() timeout to 1000 - I missed it was in millis (unlike select()) in my original commit 👍

HyukjinKwon · 2025-12-09T22:54:09Z

Does this LGTM, @gaogaotiantian ?

python/pyspark/accumulators.py

… poll()

wjszlachta-man · 2025-12-10T13:41:20Z

@gaogaotiantian let me know what you think re. latest commit - it will check for errors in both accumulators.py and daemon.py.

I also removed try/catch around:

try:
    ready_fds = select.select([0, listen_sock], [], [], 1)[0]
except select.error as ex:
    if ex[0] == EINTR:
        continue
    else:
        raise

in daemon.py as this is old code and only needed for Python <3.5 (see: https://peps.python.org/pep-0475/ - from Python 3.5 onward select.select() will automatically retry system calls on EINTR).

Considering python_requires=">=3.9" as of Spark>=4, it should be safe to remove.

python/pyspark/accumulators.py

…stency

wjszlachta-man · 2025-12-15T15:45:38Z

@gaogaotiantian happy to merge in this form?

Let me know if ok and I can propagate similar changes to py4j/py4j#560.

gaogaotiantian · 2025-12-15T18:49:49Z

Yes I think similar changes should be done for py4j.

allisonwang-db · 2025-12-15T23:01:51Z

python/pyspark/daemon.py

+                        # Could be POLLERR or POLLNVAL (select would raise in this case).
+                        raise PySparkRuntimeError(f"Polling error - event {event} on fd {fd}")


Does this introduce any behavior change? What's the original behavior? Can we improve this error message to make it more user friendly?

Hmm, the original behavior is probably to raise a Python builtin exception. To be honest if we need to raise this exception the situation is pretty bad - it would be an unexpected networking issue. I don't think we have coverage here.

Anyway, because this is a Python exception raised from worker side, the driver will always see a PythonException. The traceback might be different - but this is already a super rare situation and I don't think users will be relying on this.

On the other hand, yes there's improvement room for exception message.

Original behaviour would be select() raising OSError in situations, where poll() events POLLERR/POLLNVAL raise PySparkRuntimeError.

Agree users shouldn't rely on that behaviour, so thought using PySparkRuntimeError here makes most sense.

@allisonwang-db agree with @gaogaotiantian the exception here should be very rare and due to networking issues - can you think of a better worded error message?

Most of the PySparkRuntimeError are user facing exceptions with proper error classes and actionable error messages. If this is a rare low level system issue, it's better to keep the original exception (OSError) error message. WDYT?

Personally I wouldn't make it OSError as this is not a system call error (why I went for RuntimeError instead).

But honestly not too fussed and happy to change it to whatever you or Spark committers find appropriate as your feel for the codebase is better than mine (thanks for reviewing it!).

My focus here is to merge the fix that removes dependency on select.select() as this is something we have been patching internally for quite some time now to allow our researchers to run with large number of executors. The problem is even worse if you load a lot of shared libs (for example via ctypes), which take fds <1024 so you can hit FD_SETSIZE with fewer than 1000 executors.

I don't like OSError either. I think the general rule is to make all known exceptions a spark error. We have the ability to add error class in the future. Again, on driver side this is just a PythonException - also no errorClass with it because it's not supported yet.

HyukjinKwon · 2025-12-21T00:33:17Z

Merged to master.

[SPARK-51966][PYTHON] replace select.select() with select.poll() on p…

88e18d4

…osix On glibc based Linux systems select() can monitor only file descriptor numbers that are less than FD_SETSIZE (1024). This is an unreasonably low limit for many modern applications.

github-actions bot added CORE PYTHON labels Dec 3, 2025

wjszlachta-man mentioned this pull request Dec 3, 2025

fixes 559 -- replace select.select() with select.poll() on posix py4j/py4j#560

Open

[SPARK-51966][PYTHON] add env var PYSPARK_FORCE_SELECT to fallback to…

e4d08b0

… select.select()

wjszlachta-man mentioned this pull request Dec 9, 2025

[SPARK-54640][PYTHON] Replace select.select with select.poll on UNIX #53388

Closed

wjszlachta-man added 2 commits December 9, 2025 17:40

Merge branch 'master' into spark-51966-replace-select-with-poll-on-posix

6da8aad

[SPARK-51966][PYTHON] reflect changes in python/pyspark/daemon.py

4ff1538

gaogaotiantian reviewed Dec 9, 2025

View reviewed changes

python/pyspark/accumulators.py Outdated Show resolved Hide resolved

[SPARK-51966][PYTHON] handle POLLHUP, POLLERR and POLLNVAL when using…

d04b73a

… poll()

[SPARK-51966][PYTHON] add type hints for mypy

30505f9

gaogaotiantian reviewed Dec 12, 2025

View reviewed changes

python/pyspark/accumulators.py Outdated Show resolved Hide resolved

[SPARK-51966][PYTHON] use list for both select() and poll() for consi…

879db0e

…stency

gaogaotiantian approved these changes Dec 15, 2025

View reviewed changes

allisonwang-db reviewed Dec 15, 2025

View reviewed changes

HyukjinKwon approved these changes Dec 21, 2025

View reviewed changes

HyukjinKwon closed this in d3633f1 Dec 21, 2025

		# Could be POLLERR or POLLNVAL (select would raise in this case).
		raise PySparkRuntimeError(f"Polling error - event {event} on fd {fd}")

[SPARK-51966][PYTHON] Replace select.select() with select.poll() when running on POSIX os #53306

[SPARK-51966][PYTHON] Replace select.select() with select.poll() when running on POSIX os #53306

Uh oh!

Conversation

wjszlachta-man commented Dec 3, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wjszlachta-man commented Dec 3, 2025

Uh oh!

wjszlachta-man commented Dec 3, 2025

Uh oh!

HyukjinKwon commented Dec 4, 2025

Uh oh!

wjszlachta-man commented Dec 4, 2025

Uh oh!

wjszlachta-man commented Dec 9, 2025

Uh oh!

wjszlachta-man commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaogaotiantian commented Dec 9, 2025

Uh oh!

wjszlachta-man commented Dec 9, 2025

Uh oh!

gaogaotiantian commented Dec 9, 2025

Uh oh!

wjszlachta-man commented Dec 9, 2025

Uh oh!

HyukjinKwon commented Dec 9, 2025

Uh oh!

Uh oh!

wjszlachta-man commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wjszlachta-man commented Dec 15, 2025

Uh oh!

gaogaotiantian commented Dec 15, 2025

Uh oh!

allisonwang-db Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

wjszlachta-man Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjszlachta-man Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjszlachta-man Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjszlachta-man commented Dec 9, 2025 •

edited

Loading

wjszlachta-man commented Dec 10, 2025 •

edited

Loading

wjszlachta-man Dec 16, 2025 •

edited

Loading

allisonwang-db Dec 17, 2025 •

edited

Loading