-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[WIP] [SPARK-1808] Route bin/pyspark through Spark submit #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The bin/pyspark script takes two pathways, depending on the application. If the application is a python file, the script passes the python file directly to Spark submit, which launches the python application as a sub-process within the JVM. If the application is the pyspark shell, the script invokes a special python script that invokes Spark submit as a sub-process. The main benefit here is that the Python is now the parent process (rather than Scala), such that all keyboard signals are propagated to the python interpreter properly. This divergence of code path here means Spark submit needs to launch two different kinds of python runners (in Scala). Currently, Spark submit invokes the PythonRunner, which creates python subprocessses to run python applications. However, this is not applicable to the shell, because the parent process is already the python process that runs the REPL. This is why PythonRunner is split into PythonAppRunner (for launching applications) and PythonShellRunner (for launching the pyspark shell). The new bin/pyspark has been tested locally to run both the REPL and python applications successfully through Spark submit. A big TODO at this point is to make sure the IPython case is not affected.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15019/ |
|
Git exception. Jenkins, test this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15023/ |
Previously, if bin/pyspark receives an argument, it unconditionally interprets it as a python file. This is not correct. As of this commit, all uses of bin/pyspark go through Spark submit and passes the arguments on correctly.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15028/ |
|
Merged build triggered. |
|
Merged build started. |
This does not apply to running a python application with bin/pyspark, for instance.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15029/ |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15030/ |
|
Making big changes; re-opening in a bit. |
Problem. For
bin/pyspark, there is currently no other way to specify Spark configuration properties other than throughSPARK_JAVA_OPTSinconf/spark-env.sh. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified inconf/spark-defaults.conf.Solution. Have
bin/pysparkinvokebin/spark-submit, like all of its counterparts in Scala land (i.e.bin/spark-shell,bin/run-example). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.Details.
bin/pysparkinherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already offers an existing code path to run python applications. For cases in whichbin/pysparkis given a python file, we can simply call pass the file directly to spark-submit. This is the simple case:bin/pysparkpasses the python file to Spark submitPythonAppRunnerPythonAppRunnersets up the Py4j GatewayServer on the Java sidePythonAppRunnerruns the python file as a sub-processCase (2) is more involved. We cannot simply run the shell as another application, and use the existing code path in Spark submit as in (1). This is because the keyboard signals will not be propagated to the python interpreter properly, and dealing with each signal individually is cumbersome and likely not comprehensive. Thus, this PR takes the approach of making Python the parent process instead. This allows all keyboard signals to be propagated to the python REPL first, and then to the JVM:
bin/pysparkcallspython/pyspark/repl.pyrepl.pycalls Spark submit as a sub-processPythonShellRunnerPythonShellRunnersets up the Py4j GatewayServer onrepl.pylearns the Py4j gateway server port fromPythonShellRunnerthrough socketsrepl.pycreates a SparkContext using this gateway serverrepl.pystarts a REPL with this SparkContextTODO. Currently, the IPython case works only for the embedded shell, but not for the notebooks. We should make it work for all cases. Also, we need to update
bin/pyspark.cmdas well so Windows doesn't get left behind.Comments and feedback are most welcome.