[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

dvogelbacher · 2019-05-22T15:28:23Z

What changes were proposed in this pull request?

Similar to #24070, we now propagate SparkExceptions that are encountered during the collect in the java process to the python process.

Fixes https://jira.apache.org/jira/browse/SPARK-27805

How was this patch tested?

Added a new unit test

dvogelbacher · 2019-05-22T15:32:17Z

@BryanCutler @HyukjinKwon could you please take a look at this one as well?

BryanCutler · 2019-05-22T16:34:09Z

We should definitely have had a test for this, but I would think the error would occur in the call to self._jdf.collectAsArrowToPython() and then get propagated through Py4j, so we shouldn't need to do any special handling. I'll have to look into whats going on a little bit later..

dvogelbacher · 2019-05-22T16:49:45Z

collectAsArrowToPython will just return the socket info from PythonRDD.serveToStream("serve-Arrow"). The exception will occur during the runJob which is inside the serveToStream, which will be executed in a background thread. When the background thread encounters an exception it will close the OutputStream.
The ArrowStreamSerializer in the python process will then think that it read all the batches after which the ArrowCollectSerializer will try to read the batch order indices and throw an EofError as those were never written.

Also note that before #22275 (which introduced the batch order indices) this would not have resulted in any error on the python side. We would have just dropped some partitions without throwing an error. Now at least we get an error but it is not a very helpful one.

dvogelbacher · 2019-05-28T13:21:43Z

Do you have any more thoughts on this @BryanCutler ?

BryanCutler · 2019-05-28T22:16:48Z

Yes, you are right, this is the same issue as toLocalIterator in #24070 and needs to be fixed. This is a real problem for branch-2.4 which, like you said, could cause toPandas to return a partial result without raising the error. @HyukjinKwon do you think would it make sense to patch branch-2.4 with a manual fix?

BryanCutler

Thanks for the fix @dvogelbacher , I had a few comments but looks pretty good

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

python/pyspark/sql/tests/test_arrow.py

python/pyspark/serializers.py

BryanCutler · 2019-05-29T00:04:08Z

ok to test

SparkQA · 2019-05-29T03:06:21Z

Test build #105882 has finished for PR 24677 at commit 4f57b7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-29T03:12:59Z

Test build #105883 has finished for PR 24677 at commit 4f57b7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-05-29T04:13:41Z

Yes, you are right, this is the same issue as toLocalIterator in #24070 and needs to be fixed. This is a real problem for branch-2.4 which, like you said, could cause toPandas to return a partial result without raising the error. @HyukjinKwon do you think would it make sense to patch branch-2.4 with a manual fix?

this sounds important...

SparkQA · 2019-05-29T17:38:24Z

Test build #105915 has finished for PR 24677 at commit d9936d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-29T17:59:33Z

Test build #105916 has finished for PR 24677 at commit ccfeb9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

This looks pretty good now @dvogelbacher , I'm just not sure if it should write the error in a finally block and possibly re-throw the exception. Let me look into a little more

BryanCutler · 2019-05-30T00:28:37Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala


-        // After processing all partitions, end the stream and write batch order indices
+        // After processing all partitions, end the batch stream
        batchWriter.end()


I'm wondering if this and the code below should be in a finally block?

If we put it into a finally block but only catch SparkException then that would be wrong: If a different exception gets thrown then we would go into case None, end the stream as if nothing happened and only get partial, incorrect data on the python side.
If we want to put this into a finally block then we should catch all exceptions but I figured I'd do the same as in https://github.com/apache/spark/pull/24070/files#r279589039

It should be fine as is, if any exception that isn't a SparkException gets thrown then we will never reach this code. Instead the OutputStream just gets closed and we get an EofError on the python side (like we do right now for all Exceptions).

Any more thoughts on this @BryanCutler ?

Yeah, I think this is fine

BryanCutler · 2019-05-30T00:34:22Z

this sounds important...

Yeah @felixcheung , it could be a nasty problem. I think the logs will show the job had an error, but the application python script would continue to run with partial results.. Let me verify this in branch-2.4. What are your thoughts on possibly backporting a fix?

After checking it out with branch-2.4, it is possible to get partial results in Python, but the JVM error is shown and is pretty obvious. It's unfortunate the this wasn't caught earlier, but I don't think it's worth the risk to backport right now.

BryanCutler

LGTM

BryanCutler · 2019-06-04T17:12:34Z

merged to master, thanks @dvogelbacher !

felixcheung · 2019-06-09T03:22:27Z

After checking it out with branch-2.4, it is possible to get partial results in Python, but the JVM error is shown and is pretty obvious. It's unfortunate the this wasn't caught earlier, but I don't think it's worth the risk to backport right now.

ah ok

HyukjinKwon · 2019-06-10T01:11:28Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+          case Some(exception) =>
+            // Signal failure and write error message
+            out.writeInt(-1)
+            PythonRDD.writeUTF(exception.getMessage, out)


Sorry for late response, @BryanCutler

Yea, I had the same question #24677 (comment). Thanks for details at #24677 (comment). It would have been better if those are commented since at least two committers raised the same questions :-).

@HyukjinKwon do you think would it make sense to patch branch-2.4 with a manual fix?

Yea, I don't mind backporting it (don't strongly feel we should do too).

HyukjinKwon

LGTM too but .. I wonder if this is the only way.

@dvogelbacher would you mind if I ask to check normal collect() code path too? Dataset.collectToPython() looks like a same instance.

BryanCutler · 2019-06-10T22:16:26Z

LGTM too but .. I wonder if this is the only way.
@dvogelbacher would you mind if I ask to check normal collect() code path too? Dataset.collectToPython() looks like a same instance.

The collect() code path is different and it doesn't have this exact problem, but it's possible it could fail in another way. @HyukjinKwon @felixcheung @dvogelbacher lets continue the discussion in #24834 where I'm trying out another way to do this.

…arrow enabled ## What changes were proposed in this pull request? Similar to apache#24070, we now propagate SparkExceptions that are encountered during the collect in the java process to the python process. Fixes https://jira.apache.org/jira/browse/SPARK-27805 ## How was this patch tested? Added a new unit test Closes apache#24677 from dvogelbacher/dv/betterErrorMsgWhenUsingArrow. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

…arrow enabled (#626) Cherry-pick of apache#24677

dvogelbacher added 3 commits May 22, 2019 11:10

write failing test

64fcc12

write success and optional error message

3b4fcbe

fix import

4f57b7d

dvogelbacher changed the title ~~[SPARK-27805][PYTHON] Propagate SparkExceptions for toPandas with arrow enabled~~ [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled May 22, 2019

BryanCutler requested changes May 28, 2019

View reviewed changes

dvogelbacher added 2 commits May 29, 2019 09:51

cr

d9936d5

better comment

ccfeb9e

BryanCutler reviewed May 30, 2019

View reviewed changes

BryanCutler approved these changes Jun 4, 2019

View reviewed changes

BryanCutler closed this in f9ca8ab Jun 4, 2019

HyukjinKwon reviewed Jun 10, 2019

View reviewed changes

BryanCutler mentioned this pull request Jun 10, 2019

[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors #24834

Closed

dvogelbacher mentioned this pull request Nov 20, 2019

Cherry pick "Propagate SparkExceptions during toPandas with arrow enabled" palantir/spark#623

Closed

dvogelbacher mentioned this pull request Nov 25, 2019

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled palantir/spark#626

Merged

bulldozer-bot bot pushed a commit to palantir/spark that referenced this pull request Nov 25, 2019

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with …

9c66838

…arrow enabled (#626) Cherry-pick of apache#24677

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

dvogelbacher commented May 22, 2019

dvogelbacher commented May 22, 2019

BryanCutler commented May 22, 2019

dvogelbacher commented May 22, 2019

dvogelbacher commented May 28, 2019

BryanCutler commented May 28, 2019

BryanCutler left a comment

BryanCutler commented May 29, 2019

SparkQA commented May 29, 2019

SparkQA commented May 29, 2019

felixcheung commented May 29, 2019

SparkQA commented May 29, 2019

SparkQA commented May 29, 2019

BryanCutler left a comment

BryanCutler May 30, 2019

dvogelbacher May 30, 2019

dvogelbacher Jun 4, 2019

BryanCutler Jun 4, 2019

BryanCutler commented May 30, 2019 •

edited

Loading

BryanCutler left a comment

BryanCutler commented Jun 4, 2019

felixcheung commented Jun 9, 2019

HyukjinKwon Jun 10, 2019

HyukjinKwon left a comment

BryanCutler commented Jun 10, 2019

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677

Conversation

dvogelbacher commented May 22, 2019

What changes were proposed in this pull request?

How was this patch tested?

dvogelbacher commented May 22, 2019

BryanCutler commented May 22, 2019

dvogelbacher commented May 22, 2019

dvogelbacher commented May 28, 2019

BryanCutler commented May 28, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler commented May 29, 2019

SparkQA commented May 29, 2019

SparkQA commented May 29, 2019

felixcheung commented May 29, 2019

SparkQA commented May 29, 2019

SparkQA commented May 29, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler May 30, 2019

Choose a reason for hiding this comment

dvogelbacher May 30, 2019

Choose a reason for hiding this comment

dvogelbacher Jun 4, 2019

Choose a reason for hiding this comment

BryanCutler Jun 4, 2019

Choose a reason for hiding this comment

BryanCutler commented May 30, 2019 • edited Loading

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler commented Jun 4, 2019

felixcheung commented Jun 9, 2019

HyukjinKwon Jun 10, 2019

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

BryanCutler commented Jun 10, 2019

BryanCutler commented May 30, 2019 •

edited

Loading