[SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler #43600

HyukjinKwon · 2023-10-31T11:35:23Z

What changes were proposed in this pull request?

This PR improves spark.python.worker.faulthandler.enabled feature by catching IOException instead of EOFException (narrower).

Why are the changes needed?

Exceptions such as java.net.SocketException: Connection reset can happen because the worker unexpectedly die. We should better catch all IO exception there.

Does this PR introduce any user-facing change?

Yes, but only in special cases. When the worker dies unexpectedly during its initialization, this can happen.

How was this patch tested?

I tested this with Spark Connect:

$ cat <<EOT >> malformed_daemon.py
import ctypes

from pyspark import daemon
from pyspark import TaskContext


def raise_segfault():
    ctypes.string_at(0)


# Throw a segmentation fault during init.
TaskContext._getOrCreate = raise_segfault


if __name__ == '__main__':
    daemon.manager()
EOT

./sbin/start-connect-server.sh --conf spark.python.daemon.module=malformed_daemon --conf spark.python.worker.faulthandler.enabled=true --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`

./bin/pyspark --remote "sc://localhost:15002"

from pyspark.sql.functions import udf
spark.addArtifact("malformed_daemon.py", pyfile=True)
spark.range(1).select(udf(lambda x: x)("id")).collect()

Before

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
    table, schema = self._to_table()
    ...
  File "/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
    raise convert_exception(
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
	at 
      ...
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Driver stacktrace:

JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
	at 
    ...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:833)

After

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
    table, schema = self._to_table()
    ... 
"/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
    raise convert_exception(
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x00007ff85d338700 (most recent call first):
  File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
  File "/private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/spark-397ac42b-c05b-4f50-a6b8-ede30254edc9/userFiles-fd70c41e-46b9-44ed-b781-f8dea10bcb4a/5ce3da24-912a-4207-af82-5dfc8a845714/malformed_daemon.py", line 8 in raise_segfault
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 1450 in main
  ...
"/.../miniconda3/envs/python3.9/lib/python3.9/runpy.py", line 197 in _run_module_as_main

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:550)
	at 
     ...
java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
	at org.apache.spark.sql.execution.python.BasePythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:92)
	... 30 more

Driver stacktrace:

JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x00007ff85d338700 (most recent call first):
  File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
...

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2023-10-31T11:36:11Z

Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806

HyukjinKwon · 2023-10-31T11:36:17Z

cc @ueshin

LuciferYang

+1, LGTM

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-10-31T15:46:56Z

Although it looks irrelevant, could you re-trigger the failed PySpark test, @HyukjinKwon ?

Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806

HyukjinKwon · 2023-10-31T22:33:11Z

Retriggered and passed at https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18242116209

HyukjinKwon · 2023-10-31T22:33:19Z

Merged to master.

ueshin · 2023-11-01T02:08:09Z

Late LGTM.

…for Python execution in SQL ### What changes were proposed in this pull request? This PR proposes to make `faulthandler` as a runtime configuration so we can turn on and off during runtime. ### Why are the changes needed? `faulthandler` feature within PySpark is really useful especially to debug an errors that regular Python interpreter cannot catch out of the box such as a segmentation fault errors, see also #43600. It would be very useful to convert this as a runtime configuration without restarting the shell. ### Does this PR introduce _any_ user-facing change? Yes, users can now set `spark.sql.execution.pyspark.udf.faulthandler.enabled` during runtime to enable `faulthandler` feature. ### How was this patch tested? Unittest added ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43635 from HyukjinKwon/runtime-conf. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

more segfault catch

becf398

github-actions bot added CORE PYTHON labels Oct 31, 2023

LuciferYang approved these changes Oct 31, 2023

View reviewed changes

dongjoon-hyun approved these changes Oct 31, 2023

View reviewed changes

HyukjinKwon closed this in 18e0795 Oct 31, 2023

HyukjinKwon mentioned this pull request Nov 2, 2023

[SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL #43635

Closed

HyukjinKwon deleted the more-segfault branch January 15, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler #43600

[SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler #43600

HyukjinKwon commented Oct 31, 2023 •

edited

HyukjinKwon commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

LuciferYang left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

ueshin commented Nov 1, 2023

[SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler #43600

[SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler #43600

Conversation

HyukjinKwon commented Oct 31, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

LuciferYang left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

HyukjinKwon commented Oct 31, 2023

ueshin commented Nov 1, 2023

HyukjinKwon commented Oct 31, 2023 •

edited