Skip to content

[SPARK-2313] PySpark pass port rather than stdin#3424

Closed
lvsoft wants to merge 3 commits intoapache:masterfrom
lvsoft:feature/PySparkPassPortRatherThanSTDIN
Closed

[SPARK-2313] PySpark pass port rather than stdin#3424
lvsoft wants to merge 3 commits intoapache:masterfrom
lvsoft:feature/PySparkPassPortRatherThanSTDIN

Conversation

@lvsoft
Copy link

@lvsoft lvsoft commented Nov 24, 2014

This patch will fix [SPARK-2313].

It peek available free port number, and pass the port number to Py4j.Gateway for binding via command line argument.
The initial value of the port number is scanned beginning at the mod of PID, which could avoid potential concurrency issues such as supporting multiple PySpark instances in future. And the port number printed from Py4j in STDIN is also parsed for double check.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@davies
Copy link
Contributor

davies commented Nov 24, 2014

I think the motivation of SPARK-2313 is to remove the dependency of STDIN to return the port back to Python, just replace it by a socket may works (domain socket may don't work in Window?). There is race condition that the peeked free port will be occupied by other program.

So, the approach will be:

  1. bind to random socket in python,
  2. pass the port into JVM, connect to it
  3. Java Gateway binds to random port
  4. pass the port back via socket (created in 1)
  5. read the port from socket (created in 1), close it

@lvsoft
Copy link
Author

lvsoft commented Nov 25, 2014

I think this is a better solution.
However, pass the port back via socket will affair py4j too.
Currently, stdin is the only supported method in py4j to pass back the port number.

asfgit pushed a commit that referenced this pull request Feb 16, 2015
…hon driver

This patch changes PySpark so that the GatewayServer's port is communicated back to the Python process that launches it over a local socket instead of a pipe.  The old pipe-based approach was brittle and could fail if `spark-submit` printed unexpected to stdout.

To accomplish this, I wrote a custom `PythonGatewayServer.main()` function to use in place of Py4J's `GatewayServer.main()`.

Closes #3424.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits:

6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed
0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies.
9bdb4b6 [Josh Rosen] Handle case where getListeningPort returns -1
3fb7ed1 [Josh Rosen] Remove stdout=PIPE
2458934 [Josh Rosen] Use underscore to mark env var. as private
d12c95d [Josh Rosen] Use Logging and Utils.tryOrExit()
e5f9730 [Josh Rosen] Wrap everything in a giant try-block
2f70689 [Josh Rosen] Use stdin PIPE to share fate with driver
8bf956e [Josh Rosen] Initial cut at passing Py4J gateway port back to driver via socket

(cherry picked from commit 0cfda84)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 0cfda84 Feb 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments