Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Sep 22, 2014

Python modules added through addPyFile should take precedence over system modules.

This patch put the path for user added module in the front of sys.path (just after '').

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit c16c392.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit 6b0002f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2492 at commit 6b0002f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit 6b0002f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit f7ff4da.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

BTW: it's a bit dangerous that user can upload new module to modify the default behavior of system. Currently, it's hard to find the the correct position to insert user's module.

Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change).

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

Tests timed out after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2492 at commit 6b0002f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2492 at commit f7ff4da.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit f7ff4da.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2492 at commit 4a2af78.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2492 at commit f7ff4da.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2492 at commit 4a2af78.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

Merged build finished. Test PASSed.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20668/

@davies
Copy link
Contributor Author

davies commented Sep 23, 2014

Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change).

Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order, such as

>>> import sys
>>> sys.path
['', '//anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7.egg', '//anaconda/lib/python2.7/site-packages/protobuf-2.5.0-py2.7.egg', '//anaconda/lib/python2.7/site-packages/msgpack_python-0.4.2-py2.7-macosx-10.5-x86_64.egg', '//anaconda/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg', '/Users/daviesliu/work/spark/python/lib', '/Users/daviesliu/work/spark/python/lib/py4j-0.8.2.1-src.zip', '/Users/daviesliu/work/spark/python', '//anaconda/lib/python27.zip', '//anaconda/lib/python2.7', '//anaconda/lib/python2.7/plat-darwin', '//anaconda/lib/python2.7/plat-mac', '//anaconda/lib/python2.7/plat-mac/lib-scriptpackages', '//anaconda/lib/python2.7/lib-tk', '//anaconda/lib/python2.7/lib-old', '//anaconda/lib/python2.7/lib-dynload', '//anaconda/lib/python2.7/site-packages', '//anaconda/lib/python2.7/site-packages/PIL', '//anaconda/lib/python2.7/site-packages/runipy-0.1.0-py2.7.egg']

it's not easy to find a position which is before third-package but after standard module.

@JoshRosen
Copy link
Contributor

Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order

Are you worried about a user adding a Python module whose name conflicts with a built-in module, thereby shadowing it? I think this is a general Python problem that can occur even without sys.path manipulation, which is why it's bad to have top-level modules that have the same name as built-in ones (and also why relative imports can be bad): http://www.evanjones.ca/python-name-clashes.html

@davies
Copy link
Contributor Author

davies commented Sep 23, 2014

I think it's fine to move on, and remove the comment about risk in PR's description.

@mattf
Copy link
Contributor

mattf commented Sep 24, 2014

this is a nice addition. re danger, i'll add that the user is only able to endanger herself.

+1 lgtm

sys.path.append(path)
if dirname not in sys.path:
sys.path.append(dirname)
if filename.lower().endswith("zip") or filename.lower().endswith("egg"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that spark.submit.pyFiles is allowed to contain .py files, too:

  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.

Will this new filtering by .zip and .egg prevent this from working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .py files will be put in root_dir, can be imported by name, so it should not put in sys.path. It will depend on that spark-submit copy the '.pyfile intoroot_dir` locally.

Put basedir of .py file into sys.path, will bring other issues if there are other files in the same directory, such copy.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we explicitly add root_dir to sys.path? I don't think we can always assume that the Python driver / worker are executed from inside of root_dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see that we do add root_dir to the path in worker.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root_dir is already added into sys.path, see LINE 174

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great. In that case, this PR looks good to me, so I'm going to merge it. Thanks!

@asfgit asfgit closed this in c854b9f Sep 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants