[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors #24834

BryanCutler · 2019-06-10T22:11:58Z

What changes were proposed in this pull request?

Currently with toLocalIterator() and toPandas() with Arrow enabled, if the Spark job being run in the background serving thread errors, it will be caught and sent to Python through the PySpark serializer.
This is not the ideal solution because it is only catch a SparkException, it won't handle an error that occurs in the serializer, and each method has to have it's own special handling to propagate the error.

This PR instead returns the Python Server object along with the serving port and authentication info, so that it allows the Python caller to join with the serving thread. During the call to join, the serving thread Future is completed either successfully or with an exception. In the latter case, the exception will be propagated to Python through the Py4j call.

How was this patch tested?

Existing tests

BryanCutler · 2019-06-10T22:29:13Z

I think might be a better way to propagate exceptions from the Python connection serving thread for the cases of toPandas() with Arrow and toLocalIterator(). This way, any exception in the serving thread will be raised in Python when making the call getResult() in Python, which will join the thread and evaluate the thread future and raise any exception that occurred.

Here I duplicated the serveToStream() code path so that if we want to backport this fix to branch-2.4, it will be a lot easier. If we don't want to backport, I can clean this up quite a bit. I still think it is probably better to not backport due to the risk, but I'll leave this as is if we want to discuss the possibility.

BryanCutler · 2019-06-10T22:51:28Z

From the discussion in #24677 , regarding the DataFrame.collect() with collectToPython() code path. This doesn't have quite the same issue. collect() and the standard toPandas() methods first make a Py4j call, then in Scala they run the Spark job and gather all data in the main thread. Once data is local, the serializer is started in the background thread, completing the Py4j call. If the spark job raises an error, then the Py4j call is interrupted and the error is propagated. If somehow there is an error during the serializer part, then I don't think it will be propagated to Python.

toLocalIterator() and toPandas() work differently by first making a Py4j call, which sets up the socket connection and starts the background thread. The Py4j call is returned immediately, then the Spark job(s) are run in the background thread. The current approach catches a SparkException in the background thread and sends it through the serializer, where this PR returns the serving thread object in the first Py4j call, so that an additional Py4j call will synchronize and evaluate the thread future, raising the exception if occurred.

BryanCutler · 2019-06-10T22:53:23Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+   * This is the same as serveToStream, only it returns a server object that
+   * can be used to sync in Python.
+   */
+  private[spark] def serveToStreamWithSync(


This could be cleaned up and replace the existing serveToStream. It just returns the SocketAuthServer object as the third element in the Array, and it could be ignored if no synchronization is needed.

BryanCutler · 2019-06-10T22:54:15Z

core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala

+private [spark] class SocketFuncServer(
+    authHelper: SocketAuthHelper,
+    threadName: String,
+    func: Socket => Unit) extends SocketAuthServer[Unit](authHelper, threadName) {


I don't think we need SockAuthServer.setupOneConnectionServer if we have this also, so it could be cleaned up

I removed SockAuthServer.setupOneConnectionServer and replaced usage with SocketFuncServer

BryanCutler · 2019-06-10T22:54:34Z

python/pyspark/sql/dataframe.py


        # Collect list of un-ordered batches where last element is a list of correct order indices
-        results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
+        from pyspark.rdd import _create_local_socket


this should be cleaned up. below basically duplicates _load_from_socket

BryanCutler · 2019-06-10T23:11:20Z

I'm attaching error messages from toPandas()

master without Arrow
master_toPandas_error_no_arrow.txt
master with Arrow
master_toPandas_error_with_arrow.txt
this PR with Arrow
SPARK-27992_toPandas_error_with_arrow.txt

To sum up the differences:
(2) master with arrow, is a little different because the Python error is a RuntimeError and not Py4JJavaError and for some reason there is a Driver stacktrace section that is empty - probably not a big deal but still looks odd and I'm not sure why it is empty.

(3) this PR, it is more similar to (1) in that it has a Py4JJavaError and the Driver stacktrace is populated correctly, but the exception is actually displayed twice - first when the exception occurs in the JVM serving thread, and then again when it is propagated by Py4j. I'm not sure if there is a good way to avoid this..

BryanCutler · 2019-06-10T23:11:46Z

What do you guys think @HyukjinKwon @felixcheung @dvogelbacher ?

SparkQA · 2019-06-11T00:28:48Z

Test build #106364 has finished for PR 24834 at commit f20a156.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dvogelbacher · 2019-06-11T21:28:54Z

I really like the idea. This is much better than having to define a specific protocol for propagating errors like we currently do.

From having a short look at the R code it seems like R would also be affected by https://jira.apache.org/jira/browse/SPARK-27805 and using this same mechanism in R would fix it there, too?

dvogelbacher · 2019-06-11T21:22:54Z

python/pyspark/sql/dataframe.py

+        from pyspark.rdd import _create_local_socket
+        sock_file = _create_local_socket((port, auth_secret))
+        results = list(ArrowCollectSerializer().load_stream(sock_file))
+        jserver_obj.getResult()  # Join serving thread and raise any exceptions


we might want to have this in a finally clause, so that if we have an error during serialization (which might be caused by an exception in the JVM) we will still get the original exception.

felixcheung

looks reasonable. agreed on @dvogelbacher comment

BryanCutler · 2019-06-13T17:08:52Z

Thanks @dvogelbacher and @felixcheung , I will clean this up then and apply the same fix to toLocalIterator. If it's straightforward to also fix the R collection with Arrow, then I'll do that too, otherwise we can do it in a followup.

HyukjinKwon · 2019-06-14T01:48:04Z

python/pyspark/sql/dataframe.py

-        results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
+        from pyspark.rdd import _create_local_socket
+        sock_file = _create_local_socket((port, auth_secret))
+        results = list(ArrowCollectSerializer().load_stream(sock_file))


Yea .. looks same as _load_from_socket ..

HyukjinKwon · 2019-06-14T01:51:00Z

python/pyspark/sql/dataframe.py

@@ -2200,10 +2200,13 @@ def _collectAsArrow(self):
        .. note:: Experimental.
        """
        with SCCallSiteSync(self._sc) as css:
-            sock_info = self._jdf.collectAsArrowToPython()
+            port, auth_secret, jserver_obj = self._jdf.collectAsArrowToPython()


Do you think if it makes sense to make a _serialize_from_jvm (like _serialize_to_jvm)? This could be done separately.

Possibly, _load_from_socket is basically deserializing from the JVM

HyukjinKwon · 2019-06-14T01:53:11Z

@BryanCutler, I don't mind but strongly feel about backporting. If you think we should, we can do.

HyukjinKwon · 2019-06-14T01:58:43Z

The approach looks fine.

BryanCutler · 2019-06-20T20:55:19Z

core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala

-          JavaUtils.closeQuietly(serverSocket)
-          JavaUtils.closeQuietly(sock)
-        }
+  def serveToStream(


Moved this from SocketAuthHelper because it seemed more fitting to be here

BryanCutler · 2019-06-20T20:57:59Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -1389,7 +1389,9 @@ private[spark] object Utils extends Logging {
        originalThrowable = cause
        try {
          logError("Aborting task", originalThrowable)
-          TaskContext.get().markTaskFailed(originalThrowable)
+          if (TaskContext.get() != null) {


Using this utility here https://github.com/apache/spark/pull/24834/files#diff-0a67bc4d171abe4df8eb305b0f4123a2R184, where the task fails and completes before hitting the catchBlock, so TaskContext.get() returns a null

BryanCutler · 2019-06-20T21:09:25Z

I cleaned up some things with SocketAuthServer and applied the fix to toLocalIterator also. I wasn't able to checkout doing a similar approach for R collection with Arrow. @HyukjinKwon would you be able to check it out and see if a followup is needed?

SparkQA · 2019-06-20T23:08:06Z

Test build #106739 has finished for PR 24834 at commit ead8978.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-20T23:10:21Z

Test build #106740 has finished for PR 24834 at commit 5fd8684.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-21T05:13:27Z

Let me give some input within 3 days ..

HyukjinKwon · 2019-06-24T04:03:06Z

core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala

@@ -66,42 +87,45 @@ private[spark] abstract class SocketAuthServer[T](

 }

+/**
+ * Create a socket server class and run user function on the socket in a background thread.
+ * This is the same as calling SocketAuthServer.setupOneConnectionServer except it creates


Seems we don't have setupOneConnectionServer anymore.

Oops, good catch

HyukjinKwon · 2019-06-24T04:35:30Z

python/pyspark/rdd.py

+    """
+    port = sock_info[0]
+    auth_secret = sock_info[1]
+    sockfile, sock = local_connect_and_auth(port, auth_secret)
    # The RDD materialization time is unpredictable, if we set a timeout for socket reading
    # operation, it will very possibly fail. See SPARK-18281.
    sock.settimeout(None)
    return sockfile


 def _load_from_socket(sock_info, serializer):


@BryanCutler, what does sock_info expect to be? Seems it can be both 2-tuple and 3-tuple (with server).

Uggh, yeah I'm not too happy with this. Java returns a 3-tuple with (port, auth_secret, server) and most places only use the first 2, such as _load_from_socket. It gets a little confusing, so I thought it might be better to expand the values returned by java for serveToStream etc., but it ended up with a lot of changes where the third value is ignored like this

port, auth_secret, _ = ...

and I don't think it really made things clearer. I'll try to think of something better and maybe do a followup.

HyukjinKwon · 2019-06-24T04:36:22Z

I wasn't able to checkout doing a similar approach for R collection with Arrow. @HyukjinKwon would you be able to check it out and see if a followup is needed?

R side, I will take a look after merging this in.

HyukjinKwon

Looks good except that .. I think it's now difficult to read sock_info.. https://github.com/apache/spark/pull/24834/files#r296549441 ..

BTW, let's avoid refactor the codes with a fix .. it took me a while to understand the fix + track the changes ..

HyukjinKwon · 2019-06-24T04:41:37Z

cc @vanzin and @squito as well.

squito

Generally makes sense to me -- just a few comments on things which would help me follow this code when I occasionally look at it, more as a java developer that occasionally needs to work on the communication protocol.

One important thing -- please change the summary / description to replace "synchronize" with "join".

squito · 2019-06-24T16:10:18Z

core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala

-   * Create a socket server and run user function on the socket in a background thread.
-   *
-   * The socket server can only accept one connection, or close if no connection
-   * in 15 seconds.


please save this comment -- I guess move it to the class SocketAuthServer. In particular ,its helpful to note that this only accepts one connection, its not a long-lived thing which is reused.

Yeah that is a useful comment, I didn't intend to take this out. I'll put it back in.

squito · 2019-06-24T16:14:56Z

python/pyspark/rdd.py

@@ -159,7 +174,8 @@ class PyLocalIterable(object):
        """ Create a synchronous local iterable over a socket """

        def __init__(self, _sock_info, _serializer):
-            self._sockfile = _create_local_socket(_sock_info)
+            port, auth_secret, self.jserver_obj = _sock_info


just a general comment, might not make the most sense to address in this particular PR -- I'd find it really helpful if the python code which is dealing w/ java objects would annotate (somehow) the java types. Its hard for me to figure out if jserver_obj is a ServerSocket or a SocketAuthServer or Py4JJavaServer etc.

sure, I can rename it to something more fitting and I agree it should be clear what the variable is by the name

SparkQA · 2019-06-24T20:15:19Z

Test build #106844 has finished for PR 24834 at commit 20eb748.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-06-24T23:12:41Z

Thanks @HyukjinKwon and @squito for reviewing, I addressed your comments.

BryanCutler · 2019-06-26T20:10:46Z

merged to master, thanks all for reviewing!

…nection thread to propagate errors ### What changes were proposed in this pull request? This PR proposes to backport #24834 with minimised changes, and the tests added at #25594. #24834 was not backported before because basically it targeted a better exception by propagating the exception from JVM. However, actually this PR fixed another problem accidentally (see #25594 and [SPARK-28881](https://issues.apache.org/jira/browse/SPARK-28881)). This regression seems introduced by #21546. Root cause is that, seems https://github.com/apache/spark/blob/23bed0d3c08e03085d3f0c3a7d457eedd30bd67f/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3370-L3384 `runJob` with `resultHandler` seems able to write partial output. JVM throws an exception but, since the JVM exception is not propagated into Python process, Python process doesn't know if the exception is thrown or not from JVM (it just closes the socket), which results as below: ``` ./bin/pyspark --conf spark.driver.maxResultSize=1m ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled",True) spark.range(10000000).toPandas() ``` ``` Empty DataFrame Columns: [id] Index: [] ``` With this change, it lets Python process catches exceptions from JVM. ### Why are the changes needed? It returns incorrect data. And potentially it returns partial results when an exception happens in JVM sides. This is a regression. The codes work fine in Spark 2.3.3. ### Does this PR introduce any user-facing change? Yes. ``` ./bin/pyspark --conf spark.driver.maxResultSize=1m ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled",True) spark.range(10000000).toPandas() ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../pyspark/sql/dataframe.py", line 2122, in toPandas batches = self._collectAsArrow() File "/.../pyspark/sql/dataframe.py", line 2184, in _collectAsArrow jsocket_auth_server.getResult() # Join serving thread and raise any exceptions File "/.../lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/.../pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o42.getResult. : org.apache.spark.SparkException: Exception thrown in awaitResult: ... Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (6.5 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) ``` now throws an exception as expected. ### How was this patch tested? Manually as described above. unittest added. Closes #25593 from HyukjinKwon/SPARK-27992. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nection thread to propagate errors ### What changes were proposed in this pull request? This PR proposes to backport apache#24834 with minimised changes, and the tests added at apache#25594. apache#24834 was not backported before because basically it targeted a better exception by propagating the exception from JVM. However, actually this PR fixed another problem accidentally (see apache#25594 and [SPARK-28881](https://issues.apache.org/jira/browse/SPARK-28881)). This regression seems introduced by apache#21546. Root cause is that, seems https://github.com/apache/spark/blob/23bed0d3c08e03085d3f0c3a7d457eedd30bd67f/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3370-L3384 `runJob` with `resultHandler` seems able to write partial output. JVM throws an exception but, since the JVM exception is not propagated into Python process, Python process doesn't know if the exception is thrown or not from JVM (it just closes the socket), which results as below: ``` ./bin/pyspark --conf spark.driver.maxResultSize=1m ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled",True) spark.range(10000000).toPandas() ``` ``` Empty DataFrame Columns: [id] Index: [] ``` With this change, it lets Python process catches exceptions from JVM. ### Why are the changes needed? It returns incorrect data. And potentially it returns partial results when an exception happens in JVM sides. This is a regression. The codes work fine in Spark 2.3.3. ### Does this PR introduce any user-facing change? Yes. ``` ./bin/pyspark --conf spark.driver.maxResultSize=1m ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled",True) spark.range(10000000).toPandas() ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../pyspark/sql/dataframe.py", line 2122, in toPandas batches = self._collectAsArrow() File "/.../pyspark/sql/dataframe.py", line 2184, in _collectAsArrow jsocket_auth_server.getResult() # Join serving thread and raise any exceptions File "/.../lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/.../pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o42.getResult. : org.apache.spark.SparkException: Exception thrown in awaitResult: ... Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (6.5 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) ``` now throws an exception as expected. ### How was this patch tested? Manually as described above. unittest added. Closes apache#25593 from HyukjinKwon/SPARK-27992. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…propagate errors ## What changes were proposed in this pull request? Currently with `toLocalIterator()` and `toPandas()` with Arrow enabled, if the Spark job being run in the background serving thread errors, it will be caught and sent to Python through the PySpark serializer. This is not the ideal solution because it is only catch a SparkException, it won't handle an error that occurs in the serializer, and each method has to have it's own special handling to propagate the error. This PR instead returns the Python Server object along with the serving port and authentication info, so that it allows the Python caller to join with the serving thread. During the call to join, the serving thread Future is completed either successfully or with an exception. In the latter case, the exception will be propagated to Python through the Py4j call. ## How was this patch tested? Existing tests Closes apache#24834 from BryanCutler/pyspark-propagate-server-error-SPARK-27992. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

BryanCutler mentioned this pull request Jun 10, 2019

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

Closed

BryanCutler commented Jun 10, 2019

View reviewed changes

dvogelbacher reviewed Jun 11, 2019

View reviewed changes

felixcheung reviewed Jun 12, 2019

View reviewed changes

dongjoon-hyun added the IMPROVEMENT label Jun 12, 2019

HyukjinKwon reviewed Jun 14, 2019

View reviewed changes

dongjoon-hyun added PYSPARK and removed IMPROVEMENT labels Jun 14, 2019

BryanCutler added 7 commits June 19, 2019 10:17

Create alternate serveToStream path that allows for synchronization

519926f

fix imports

3a52960

Cleanup usage of SocketAuthServer and Python serverToStream

b209e0a

Revert back to using socket_info tuple as return value

c9f7fe9

sock_info typo

2fddb43

Add check for task context in tryWithSafeFinallyAndFailureCallbacks

785ce4f

Use server getResult to check for error in toLocalIterator

ead8978

BryanCutler force-pushed the pyspark-propagate-server-error-SPARK-27992 branch from f20a156 to ead8978 Compare June 20, 2019 20:49

BryanCutler commented Jun 20, 2019

View reviewed changes

fixed comment

5fd8684

BryanCutler changed the title ~~[WIP][SPARK-27992][PYTHON] Synchronize with Python connection thread to propagate errors~~ [SPARK-27992][PYTHON] Synchronize with Python connection thread to propagate errors Jun 20, 2019

HyukjinKwon reviewed Jun 24, 2019

View reviewed changes

squito reviewed Jun 24, 2019

View reviewed changes

BryanCutler changed the title ~~[SPARK-27992][PYTHON] Synchronize with Python connection thread to propagate errors~~ [SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors Jun 24, 2019

Address comments and fix docs

20eb748

BryanCutler closed this in c277afb Jun 26, 2019

BryanCutler deleted the pyspark-propagate-server-error-SPARK-27992 branch June 26, 2019 20:10

HyukjinKwon mentioned this pull request Aug 27, 2019

[SPARK-27992][SPARK-28881][PYTHON][2.4] Allow Python to join with connection thread to propagate errors #25593

Closed

[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors #24834

[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors #24834

Conversation

BryanCutler commented Jun 10, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler commented Jun 10, 2019

BryanCutler commented Jun 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Jun 10, 2019 • edited Loading

Choose a reason for hiding this comment

BryanCutler commented Jun 10, 2019

BryanCutler commented Jun 10, 2019

SparkQA commented Jun 11, 2019

dvogelbacher commented Jun 11, 2019

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

BryanCutler commented Jun 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 14, 2019

HyukjinKwon commented Jun 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jun 20, 2019

SparkQA commented Jun 20, 2019

SparkQA commented Jun 20, 2019

HyukjinKwon commented Jun 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 24, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 24, 2019

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2019

BryanCutler commented Jun 24, 2019

BryanCutler commented Jun 26, 2019

BryanCutler commented Jun 10, 2019 •

edited

Loading

BryanCutler Jun 10, 2019 •

edited

Loading