Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5929] Pyspark: Register a pip requirements file with spark_context #4897

Closed
wants to merge 13 commits into from

Conversation

buckhx
Copy link

@buckhx buckhx commented Mar 5, 2015

Ships all packages in the requirements file by installing them locally via pip and then ships the packages to the workers via addPyFile

@@ -65,8 +65,9 @@ class SparkContext(object):
_python_includes = None # zip and egg files that need to be added to PYTHONPATH

def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
environment=None, batchSize=0, serializer=PickleSerializer(), conf=None,
gateway=None, jsc=None, profiler_cls=BasicProfiler):
requirementsFile=None, environment=None, batchSize=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a parameter in the middle will break compatibility, we should put it in the end.

@davies
Copy link
Contributor

davies commented Mar 7, 2015

Good idea, this is a useful feature to have. Could you also a argument for spark-submit, (for example, --pip or --py-requirements). Also, could you add a test for it?

@davies
Copy link
Contributor

davies commented Mar 7, 2015

Jenkins, Ok to test.

@buckhx
Copy link
Author

buckhx commented Mar 10, 2015

Great, will get to that this week and add tests.

Do you think bundling importlib with pyspark is reasonable? Or is finding another way to track down the local package the way to go?

@davies
Copy link
Contributor

davies commented Mar 10, 2015

I think we can go with __import__, it's similar to importlib

@buckhx
Copy link
Author

buckhx commented Apr 3, 2015

Changed importlib to import and moved most of the requirements up. I left the pip requirement in the method itself as to not require the system has pip unless that method is used. Also added a test that tests the import of a library not found on most systems https://github.com/buckheroux/QuadKey.

@buckhx
Copy link
Author

buckhx commented Apr 20, 2015

Any traction on this?

@robcowie
Copy link

This could be very useful. One question; Does this behave with namespace packages? At first glance, it seems not. req.name will not necessarily be importable. I can't see a simple way around this but I'll have a think.

Regardless, this is a nice idea and I'd like to see it merged.

@buckhx
Copy link
Author

buckhx commented May 12, 2015

@robcowie I wouldn't think so. Do you have a good example of a pypi package that I could try to install that has a namespace? I can try and support it if possible.

@andrewor14
Copy link
Contributor

Jenkins slow test please

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #5 has finished for PR 4897 at commit 76ff637.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@buckhx
Copy link
Author

buckhx commented Jul 13, 2015

Just saw the test failure, looking into it now

gateway=None, jsc=None, profiler_cls=BasicProfiler):
environment=None, batchSize=0, serializer=PickleSerializer(),
conf=None, gateway=None, jsc=None, profiler_cls=BasicProfiler,
requirementsFile=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have two much parameters here, Can we just have addRequirementsFile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to match the pyFiles interface as closely as possible, although I do see where you are coming from. I'll re-evaluate having the requirementsFile kwarg after add a requirements file arg to the CLI.

@andrewor14
Copy link
Contributor

@buckheroux I believe the test failure is not related. @JoshRosen is looking into a hot fix.

@davies
Copy link
Contributor

davies commented Jul 13, 2015

@buckheroux Do you think that we should also support this in bin/spark-submit? For example, we should accept .txt file (via --py-files) as dependencies.

if not req.check_if_exists():
pip.main(['install', req.req.__str__()])
mod = __import__(req.name)
mod_path = mod.__path__[0]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried about this working across different types of modules (namespace packages etc)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed mod.path[0] does not work for namespace packages. Working on supporting them by bundling all paths in mod.path into a single tar.

@buckhx
Copy link
Author

buckhx commented Jul 13, 2015

@andrewor14 Sounds good.

@davies It's probably a good idea to support a CLI arg for the requirements file. I'll work on this as well.

@JoshRosen
Copy link
Contributor

@andrewor14 the test failure in the dev/run-tests doctest are because we did a shallow clone of the Spark repository into the fresh Jeknins workspace. I think that this is an ephemeral problem that will resolve itself after retesting. We could also disable / remove that test for now or skip it in Jenkins.

@JoshRosen
Copy link
Contributor

I've disabled shallow cloning in SlowSparkPullRequestBuilder, so let's try this again...

@JoshRosen
Copy link
Contributor

jenkins slow test please

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #6 has finished for PR 4897 at commit 76ff637.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #1076 has finished for PR 4897 at commit 76ff637.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@buckhx
Copy link
Author

buckhx commented Sep 25, 2015

The tests fail because one of them attempts to pip install a package and doesn't have permissions to do so. Is there a way to enable that? Or just leave the pip installing piece untested?

@buckhx
Copy link
Author

buckhx commented Oct 2, 2015

Looking to add support for namespaces, which I have tested locally and confirmed to work. I was also thinking of exposing an add_package function that would add local packages to the spark context.

@buckhx
Copy link
Author

buckhx commented Oct 21, 2015

@davies removed the requirements file from the context constructor. I looked into add --py-requirements to spark-submit, but it looked a bit more in depth than I was thinking it would be. I think it makes sense to add that as a separate commit/PR.

Also, when testing I import a package from pip, but the Spark Jenkins test errors because it doesn't allow for access to the global pypi server. Any thoughts?

@davies
Copy link
Contributor

davies commented Oct 21, 2015

cc @JoshRosen who is more familiar with Jenkins

@DoHe
Copy link

DoHe commented Dec 2, 2015

For the test you could also add a local egg file to the repository and install this instead of relying on the external pypi server.

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
@buckhx
Copy link
Author

buckhx commented Feb 19, 2016

This is a pretty big pain when using pyspark (adding modules to workers) and should definitely be included. I'll open a new pull request and to push this through.

@jurriaan
Copy link
Contributor

Yeah, this is quite a useful feature 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants