-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5929] Pyspark: Register a pip requirements file with spark_context #4897
Conversation
@@ -65,8 +65,9 @@ class SparkContext(object): | |||
_python_includes = None # zip and egg files that need to be added to PYTHONPATH | |||
|
|||
def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None, | |||
environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, | |||
gateway=None, jsc=None, profiler_cls=BasicProfiler): | |||
requirementsFile=None, environment=None, batchSize=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert a parameter in the middle will break compatibility, we should put it in the end.
Good idea, this is a useful feature to have. Could you also a argument for spark-submit, (for example, --pip or --py-requirements). Also, could you add a test for it? |
Jenkins, Ok to test. |
Great, will get to that this week and add tests. Do you think bundling importlib with pyspark is reasonable? Or is finding another way to track down the local package the way to go? |
I think we can go with |
Changed importlib to import and moved most of the requirements up. I left the pip requirement in the method itself as to not require the system has pip unless that method is used. Also added a test that tests the import of a library not found on most systems https://github.com/buckheroux/QuadKey. |
Any traction on this? |
This could be very useful. One question; Does this behave with namespace packages? At first glance, it seems not. Regardless, this is a nice idea and I'd like to see it merged. |
@robcowie I wouldn't think so. Do you have a good example of a pypi package that I could try to install that has a namespace? I can try and support it if possible. |
Jenkins slow test please |
Test build #5 has finished for PR 4897 at commit
|
Just saw the test failure, looking into it now |
gateway=None, jsc=None, profiler_cls=BasicProfiler): | ||
environment=None, batchSize=0, serializer=PickleSerializer(), | ||
conf=None, gateway=None, jsc=None, profiler_cls=BasicProfiler, | ||
requirementsFile=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have two much parameters here, Can we just have addRequirementsFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to match the pyFiles interface as closely as possible, although I do see where you are coming from. I'll re-evaluate having the requirementsFile kwarg after add a requirements file arg to the CLI.
@buckheroux I believe the test failure is not related. @JoshRosen is looking into a hot fix. |
@buckheroux Do you think that we should also support this in |
if not req.check_if_exists(): | ||
pip.main(['install', req.req.__str__()]) | ||
mod = __import__(req.name) | ||
mod_path = mod.__path__[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worried about this working across different types of modules (namespace packages etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed mod.path[0] does not work for namespace packages. Working on supporting them by bundling all paths in mod.path into a single tar.
@andrewor14 Sounds good. @davies It's probably a good idea to support a CLI arg for the requirements file. I'll work on this as well. |
@andrewor14 the test failure in the |
I've disabled shallow cloning in SlowSparkPullRequestBuilder, so let's try this again... |
jenkins slow test please |
Test build #6 has finished for PR 4897 at commit
|
Test build #1076 has finished for PR 4897 at commit
|
The tests fail because one of them attempts to pip install a package and doesn't have permissions to do so. Is there a way to enable that? Or just leave the pip installing piece untested? |
Looking to add support for namespaces, which I have tested locally and confirmed to work. I was also thinking of exposing an add_package function that would add local packages to the spark context. |
@davies removed the requirements file from the context constructor. I looked into add --py-requirements to spark-submit, but it looked a bit more in depth than I was thinking it would be. I think it makes sense to add that as a separate commit/PR. Also, when testing I import a package from pip, but the Spark Jenkins test errors because it doesn't allow for access to the global pypi server. Any thoughts? |
cc @JoshRosen who is more familiar with Jenkins |
For the test you could also add a local egg file to the repository and install this instead of relying on the external pypi server. |
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
This is a pretty big pain when using pyspark (adding modules to workers) and should definitely be included. I'll open a new pull request and to push this through. |
Yeah, this is quite a useful feature 👍 |
Ships all packages in the requirements file by installing them locally via pip and then ships the packages to the workers via addPyFile