Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398

Closed
wants to merge 24 commits into from

Conversation

buckhx
Copy link

@buckhx buckhx commented Apr 14, 2016

What changes were proposed in this pull request?

Context.addPyPackage()
Context.addPyRequirements()

Both of these methods take a package on the master and ship it to the workers when called instead of having to manually install packages on all workers.

How was this patch tested?

Unit tests are written, but I do not believe they accurately reflect a distributed environment. The test_add_py_package is not using addPyPackage and still works. The addRequirementsFile method requires internet access to hit the global pypi server and won't work on the current Jenkins build system.

We have had this patch running at Palantir for about a year in production.

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

So looking back on the predecessor PR it seemed like @davies suggested adding support for specifying these at runtime with spark-submit & it seems like the pip install permissions issue might still be present for testing.

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

I think having this supported could be very useful :)

import pip
for req in pip.req.parse_requirements(path, session=uuid.uuid1()):
if not req.check_if_exists():
pip.main(['install', req.req.__str__()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that this can sometimesrequire elevated privileges based on the issues with the previous jenkins run. What about if at startup we created a fixed temp directory per context adding it to our path with sys.path.insert(0, self.pipBase) and at install did something along the lines of:
pip.main(['install', req.req.__str__(), '--target', self.pipBase]) so that we don't have to have write permissions to the default pip target?

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

Since @andrewor14 got Jenkins to run the earlier iteration of this PR maybe you could do the same here?

@buckhx
Copy link
Author

buckhx commented Apr 18, 2016

I could see the pipBase approach working.

The other testing blocker I have is that it seems like that context that's getting created in the test is able to use local packages. The test package that's created is able to be imported on the workers without distributing it via addPyPackage. https://github.com/apache/spark/pull/12398/files#diff-388a156f4ce454fe98d7a99a0f7f0012R1950

Is there a way to get the mocked workers to not use the master python path?

The pip part is nice, but I think the meat of this PR is in the addPyPackage.

I looked into adding a flag to the CLI, but kind of went down a rabbit hole trying to figure it out. I'd advocate adding it to the CLI, but think someone working in the spark core space could get that integrated much more efficiently than myself.

@andrewor14
Copy link
Contributor

andrewor14 commented May 4, 2016

ok to test

@davies

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #2967 has finished for PR 12398 at commit ce9966e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self.assertSequenceEqual([0, 3, 6, 9], trips.collect())
finally:
shutil.rmtree(name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the extra empty line

@davies
Copy link
Contributor

davies commented May 4, 2016

@buckhx These API seems useful, could you also add an argument for bin/spark-submit (only for requirement file) ?

@buckhx
Copy link
Author

buckhx commented May 9, 2016

Added an addPyPackage example to the docstring and that test formatting. We're looking into adding the spark-submit arg for --py-requirements.

The pip API takes a file path via pip.req.parse_requirements https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L64 but it looks like it might be possible to take a string and break it out line by with a mock file name via pip.req.process_line & pip.req.preprocess https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L110

@davies
Copy link
Contributor

davies commented May 9, 2016

@buckhx Even the pip API only takes a file path, we could write the as temporary file, I think.

Robert Kruszewski and others added 2 commits May 11, 2016 00:34
@buckhx
Copy link
Author

buckhx commented May 11, 2016

@robert3005 added the --py-requirements to spark-submit. I'm looking at passing the requirements as a list or a line delimited string currently.

@buckhx buckhx changed the title [SPARK-5929][PYSPARK] Context addPyPackage and addRequirementsFile [SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements May 13, 2016
@buckhx
Copy link
Author

buckhx commented May 13, 2016

changed addRequirementsFile to addPyRequirements which takes a list of requirement strings

@buckhx
Copy link
Author

buckhx commented Jun 6, 2016

@davies how's this looking?

@SparkQA
Copy link

SparkQA commented Jun 6, 2016

Test build #3068 has finished for PR 12398 at commit 9c37e06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 7, 2016

@buckhx any interest in updating this against master?

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

@buckhx Just following up to see if this is something you are still interested in working on?

@HyukjinKwon
Copy link
Member

ping @buckhx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants