-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492
Conversation
QA tests have started for PR 2492 at commit
|
QA tests have started for PR 2492 at commit
|
QA tests have finished for PR 2492 at commit
|
QA tests have started for PR 2492 at commit
|
QA tests have started for PR 2492 at commit
|
Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change). |
Tests timed out after a configured wait of |
QA tests have finished for PR 2492 at commit
|
QA tests have finished for PR 2492 at commit
|
QA tests have started for PR 2492 at commit
|
QA tests have started for PR 2492 at commit
|
QA tests have finished for PR 2492 at commit
|
QA tests have finished for PR 2492 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order, such as >>> import sys
>>> sys.path
['', '//anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7.egg', '//anaconda/lib/python2.7/site-packages/protobuf-2.5.0-py2.7.egg', '//anaconda/lib/python2.7/site-packages/msgpack_python-0.4.2-py2.7-macosx-10.5-x86_64.egg', '//anaconda/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg', '/Users/daviesliu/work/spark/python/lib', '/Users/daviesliu/work/spark/python/lib/py4j-0.8.2.1-src.zip', '/Users/daviesliu/work/spark/python', '//anaconda/lib/python27.zip', '//anaconda/lib/python2.7', '//anaconda/lib/python2.7/plat-darwin', '//anaconda/lib/python2.7/plat-mac', '//anaconda/lib/python2.7/plat-mac/lib-scriptpackages', '//anaconda/lib/python2.7/lib-tk', '//anaconda/lib/python2.7/lib-old', '//anaconda/lib/python2.7/lib-dynload', '//anaconda/lib/python2.7/site-packages', '//anaconda/lib/python2.7/site-packages/PIL', '//anaconda/lib/python2.7/site-packages/runipy-0.1.0-py2.7.egg'] it's not easy to find a position which is before third-package but after standard module. |
Are you worried about a user adding a Python module whose name conflicts with a built-in module, thereby shadowing it? I think this is a general Python problem that can occur even without |
I think it's fine to move on, and remove the comment about risk in PR's description. |
this is a nice addition. re danger, i'll add that the user is only able to endanger herself. +1 lgtm |
sys.path.append(path) | ||
if dirname not in sys.path: | ||
sys.path.append(dirname) | ||
if filename.lower().endswith("zip") or filename.lower().endswith("egg"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that spark.submit.pyFiles
is allowed to contain .py
files, too:
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
Will this new filtering by .zip
and .egg
prevent this from working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .py
files will be put in root_dir
, can be imported by name, so it should not put in sys.path. It will depend on that spark-submit copy the '.pyfile into
root_dir` locally.
Put basedir of .py
file into sys.path, will bring other issues if there are other files in the same directory, such copy.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we explicitly add root_dir
to sys.path
? I don't think we can always assume that the Python driver / worker are executed from inside of root_dir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see that we do add root_dir
to the path in worker.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root_dir is already added into sys.path, see LINE 174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great. In that case, this PR looks good to me, so I'm going to merge it. Thanks!
Python modules added through addPyFile should take precedence over system modules.
This patch put the path for user added module in the front of sys.path (just after '').