[WIP] [SPARK-1808] Route bin/pyspark through Spark submit #787

andrewor14 · 2014-05-15T07:25:16Z

Problem. For bin/pyspark, there is currently no other way to specify Spark configuration properties other than through SPARK_JAVA_OPTS in conf/spark-env.sh. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in conf/spark-defaults.conf.

Solution. Have bin/pyspark invoke bin/spark-submit, like all of its counterparts in Scala land (i.e. bin/spark-shell, bin/run-example). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.

Details. bin/pyspark inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already offers an existing code path to run python applications. For cases in which bin/pyspark is given a python file, we can simply call pass the file directly to spark-submit. This is the simple case:

bin/pyspark passes the python file to Spark submit
Spark submit passes the python file to PythonAppRunner
PythonAppRunner sets up the Py4j GatewayServer on the Java side
PythonAppRunner runs the python file as a sub-process

Case (2) is more involved. We cannot simply run the shell as another application, and use the existing code path in Spark submit as in (1). This is because the keyboard signals will not be propagated to the python interpreter properly, and dealing with each signal individually is cumbersome and likely not comprehensive. Thus, this PR takes the approach of making Python the parent process instead. This allows all keyboard signals to be propagated to the python REPL first, and then to the JVM:

bin/pyspark calls python/pyspark/repl.py
repl.py calls Spark submit as a sub-process
Spark submit calls PythonShellRunner
PythonShellRunner sets up the Py4j GatewayServer on
repl.py learns the Py4j gateway server port from PythonShellRunner through sockets
repl.py creates a SparkContext using this gateway server
repl.py starts a REPL with this SparkContext

TODO. Currently, the IPython case works only for the embedded shell, but not for the notebooks. We should make it work for all cases. Also, we need to update bin/pyspark.cmd as well so Windows doesn't get left behind.

Comments and feedback are most welcome.

The bin/pyspark script takes two pathways, depending on the application. If the application is a python file, the script passes the python file directly to Spark submit, which launches the python application as a sub-process within the JVM. If the application is the pyspark shell, the script invokes a special python script that invokes Spark submit as a sub-process. The main benefit here is that the Python is now the parent process (rather than Scala), such that all keyboard signals are propagated to the python interpreter properly. This divergence of code path here means Spark submit needs to launch two different kinds of python runners (in Scala). Currently, Spark submit invokes the PythonRunner, which creates python subprocessses to run python applications. However, this is not applicable to the shell, because the parent process is already the python process that runs the REPL. This is why PythonRunner is split into PythonAppRunner (for launching applications) and PythonShellRunner (for launching the pyspark shell). The new bin/pyspark has been tested locally to run both the REPL and python applications successfully through Spark submit. A big TODO at this point is to make sure the IPython case is not affected.

AmplabJenkins · 2014-05-15T07:27:58Z

Merged build triggered.

AmplabJenkins · 2014-05-15T07:28:05Z

Merged build started.

AmplabJenkins · 2014-05-15T07:38:10Z

Merged build finished.

AmplabJenkins · 2014-05-15T07:38:11Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15019/

andrewor14 · 2014-05-15T08:08:17Z

Git exception. Jenkins, test this please.

AmplabJenkins · 2014-05-15T08:13:00Z

Merged build triggered.

AmplabJenkins · 2014-05-15T08:13:10Z

Merged build started.

AmplabJenkins · 2014-05-15T08:23:15Z

Merged build finished.

AmplabJenkins · 2014-05-15T08:23:15Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15023/

Previously, if bin/pyspark receives an argument, it unconditionally interprets it as a python file. This is not correct. As of this commit, all uses of bin/pyspark go through Spark submit and passes the arguments on correctly.

AmplabJenkins · 2014-05-15T19:02:59Z

Merged build triggered.

AmplabJenkins · 2014-05-15T19:03:08Z

Merged build started.

AmplabJenkins · 2014-05-15T19:03:58Z

Merged build finished.

AmplabJenkins · 2014-05-15T19:03:58Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15028/

AmplabJenkins · 2014-05-15T19:07:58Z

Merged build triggered.

AmplabJenkins · 2014-05-15T19:08:03Z

Merged build started.

This does not apply to running a python application with bin/pyspark, for instance.

AmplabJenkins · 2014-05-15T19:37:58Z

Merged build triggered.

AmplabJenkins · 2014-05-15T19:38:04Z

Merged build started.

AmplabJenkins · 2014-05-15T19:46:07Z

Merged build finished.

AmplabJenkins · 2014-05-15T19:46:07Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15029/

AmplabJenkins · 2014-05-15T20:15:07Z

Merged build finished.

AmplabJenkins · 2014-05-15T20:15:07Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15030/

andrewor14 · 2014-05-15T21:05:23Z

Making big changes; re-opening in a bit.

andrewor14 added 2 commits May 14, 2014 23:31

Merge branch 'master' of github.com:apache/spark into pyspark-submit

e195289

Actually propagate submit arguments to repl.py

5dd1117

Previously, if bin/pyspark receives an argument, it unconditionally interprets it as a python file. This is not correct. As of this commit, all uses of bin/pyspark go through Spark submit and passes the arguments on correctly.

andrewor14 added 2 commits May 15, 2014 12:06

Add license to repl.py

df0bee3

Merge branch 'master' of github.com:apache/spark into pyspark-submit

442ecf7

andrewor14 changed the title ~~[WIP] [SPARK-1808] Route bin/pyspark through Spark submit~~ [SPARK-1808] Route bin/pyspark through Spark submit May 15, 2014

Do not automatically make shell.py the PYTHONSTARTUP

637c6b4

This does not apply to running a python application with bin/pyspark, for instance.

andrewor14 changed the title ~~[SPARK-1808] Route bin/pyspark through Spark submit~~ [WIP] [SPARK-1808] Route bin/pyspark through Spark submit May 15, 2014

andrewor14 closed this May 15, 2014

andrewor14 deleted the pyspark-submit branch May 15, 2014 23:25

[WIP] [SPARK-1808] Route bin/pyspark through Spark submit #787

[WIP] [SPARK-1808] Route bin/pyspark through Spark submit #787

Uh oh!

Conversation

andrewor14 commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

andrewor14 commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

AmplabJenkins commented May 15, 2014

Uh oh!

andrewor14 commented May 15, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants