Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Oct 27, 2025

What changes were proposed in this pull request?

Uses a difference error message when kill-on-idle-timeout.

Why are the changes needed?

Currently the error message when kill-on-idle-timeout is same as when the Python worker crashes.

>>> from pyspark.sql.functions import udf
>>> import time
>>>
>>> @udf
... def f(x):
...   time.sleep(2)
...   return str(x)
...
>>> spark.conf.set("spark.sql.execution.pyspark.udf.idleTimeoutSeconds", "1s")
>>> spark.conf.set("spark.sql.execution.pyspark.udf.killOnIdleTimeout", "true")
>>>
>>> spark.range(1).select(f("id")).show()
25/10/27 16:31:16 WARN PythonUDFWithNamedArgumentsRunner: Idle timeout reached for Python worker (timeout: 1 seconds). No data received from the worker process: handle.map(_.isAlive) = Some(true), channel.isConnected = true, channel.isBlocking = false, selector.isOpen = true, selectionKey.isValid = true, selectionKey.interestOps = 1, hasInputs = false
25/10/27 16:31:16 WARN PythonUDFWithNamedArgumentsRunner: Terminating Python worker process due to idle timeout (timeout: 1 seconds)
25/10/27 16:31:16 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
...

It should show a different message to distinguish the cause:

25/10/27 16:34:55 WARN PythonUDFWithNamedArgumentsRunner: Idle timeout reached for Python worker (timeout: 1 seconds). No data received from the worker process: handle.map(_.isAlive) = Some(true), channel.isConnected = true, channel.isBlocking = false, selector.isOpen = true, selectionKey.isValid = true, selectionKey.interestOps = 1, hasInputs = false
25/10/27 16:34:55 WARN PythonUDFWithNamedArgumentsRunner: Terminating Python worker process due to idle timeout (timeout: 1 seconds)
25/10/27 16:34:55 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)
org.apache.spark.api.python.PythonWorkerException: Python worker process terminated due to idle timeout (timeout: 1 seconds)
...

Does this PR introduce any user-facing change?

Yes, the error message when kill-on-idle-timeout is different.

How was this patch tested?

Modified the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@ueshin
Copy link
Member Author

ueshin commented Oct 28, 2025

@allisonwang-db Could you also take a look to see this message sounds good? Thanks.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! This will help users differentiate between different Python worker crash reasons.

@ueshin ueshin changed the title [SPARK-54047][PYTHON] Use a difference error message when kill-on-idle-timeout [SPARK-54047][PYTHON] Use a difference error when kill-on-idle-timeout Oct 28, 2025
@ueshin
Copy link
Member Author

ueshin commented Oct 28, 2025

Thanks! merging to master.

@ueshin ueshin closed this in 3773e3f Oct 28, 2025
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

Uses a difference error message when kill-on-idle-timeout.

### Why are the changes needed?

Currently the error message when kill-on-idle-timeout is same as when the Python worker crashes.

```py
>>> from pyspark.sql.functions import udf
>>> import time
>>>
>>> udf
... def f(x):
...   time.sleep(2)
...   return str(x)
...
>>> spark.conf.set("spark.sql.execution.pyspark.udf.idleTimeoutSeconds", "1s")
>>> spark.conf.set("spark.sql.execution.pyspark.udf.killOnIdleTimeout", "true")
>>>
>>> spark.range(1).select(f("id")).show()
25/10/27 16:31:16 WARN PythonUDFWithNamedArgumentsRunner: Idle timeout reached for Python worker (timeout: 1 seconds). No data received from the worker process: handle.map(_.isAlive) = Some(true), channel.isConnected = true, channel.isBlocking = false, selector.isOpen = true, selectionKey.isValid = true, selectionKey.interestOps = 1, hasInputs = false
25/10/27 16:31:16 WARN PythonUDFWithNamedArgumentsRunner: Terminating Python worker process due to idle timeout (timeout: 1 seconds)
25/10/27 16:31:16 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed). Consider setting 'spark.sql.execution.pyspark.udf.faulthandler.enabled' or'spark.python.worker.faulthandler.enabled' configuration to 'true' for the better Python traceback.
...
```

It should show a different message to distinguish the cause:

```py
25/10/27 16:34:55 WARN PythonUDFWithNamedArgumentsRunner: Idle timeout reached for Python worker (timeout: 1 seconds). No data received from the worker process: handle.map(_.isAlive) = Some(true), channel.isConnected = true, channel.isBlocking = false, selector.isOpen = true, selectionKey.isValid = true, selectionKey.interestOps = 1, hasInputs = false
25/10/27 16:34:55 WARN PythonUDFWithNamedArgumentsRunner: Terminating Python worker process due to idle timeout (timeout: 1 seconds)
25/10/27 16:34:55 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)
org.apache.spark.api.python.PythonWorkerException: Python worker process terminated due to idle timeout (timeout: 1 seconds)
...
```

### Does this PR introduce _any_ user-facing change?

Yes, the error message when kill-on-idle-timeout is different.

### How was this patch tested?

Modified the related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52749 from ueshin/issues/SPARK-54047/kill_on_idle_timeout.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants