-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54456][PYTHON] Import worker module after fork to avoid deadlock #53166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54456][PYTHON] Import worker module after fork to avoid deadlock #53166
Conversation
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ueshin
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One QQ; otherwise LGTM.
| if len(sys.argv) > 1 and sys.argv[1].startswith("pyspark"): | ||
| import importlib | ||
|
|
||
| worker_module = importlib.import_module(sys.argv[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a QQ: Is it ok to call this multiple times? This worker function will be called every UDF execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the imported module will be cached, just like the normal import keyword.
|
Merged to master. |
### What changes were proposed in this pull request? We lazy-import the worker module after fork to avoid potential deadlock caused by importing some modules that spawns multiple threads. ### Why are the changes needed? https://discuss.python.org/t/switching-default-multiprocessing-context-to-spawn-on-posix-as-well/21868 It's impossible to do a thread-safe fork in CPython. CPython started issuing warnings from 3.12 and switched the default `multiprocessing` start method to "spawn" since 3.14. It would be a huge effort for us to give up `fork` entirely, but we can try out best to not import random modules before fork by lazy-importing worker module after fork. We already have some workers that import dangerous libraries like `pyarrow` - `plan_data_source_read` for example. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53166 from gaogaotiantian/move-worker-module-import. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
We lazy-import the worker module after fork to avoid potential deadlock caused by importing some modules that spawns multiple threads.
Why are the changes needed?
https://discuss.python.org/t/switching-default-multiprocessing-context-to-spawn-on-posix-as-well/21868
It's impossible to do a thread-safe fork in CPython. CPython started issuing warnings from 3.12 and switched the default
multiprocessingstart method to "spawn" since 3.14.It would be a huge effort for us to give up
forkentirely, but we can try out best to not import random modules before fork by lazy-importing worker module after fork.We already have some workers that import dangerous libraries like
pyarrow-plan_data_source_readfor example.Does this PR introduce any user-facing change?
No
How was this patch tested?
CI should pass.
Was this patch authored or co-authored using generative AI tooling?
No