New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677
[SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled #24677
Conversation
@BryanCutler @HyukjinKwon could you please take a look at this one as well? |
We should definitely have had a test for this, but I would think the error would occur in the call to |
Also note that before #22275 (which introduced the batch order indices) this would not have resulted in any error on the python side. We would have just dropped some partitions without throwing an error. Now at least we get an error but it is not a very helpful one. |
Do you have any more thoughts on this @BryanCutler ? |
Yes, you are right, this is the same issue as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @dvogelbacher , I had a few comments but looks pretty good
ok to test |
Test build #105882 has finished for PR 24677 at commit
|
Test build #105883 has finished for PR 24677 at commit
|
this sounds important... |
Test build #105915 has finished for PR 24677 at commit
|
Test build #105916 has finished for PR 24677 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good now @dvogelbacher , I'm just not sure if it should write the error in a finally block and possibly re-throw the exception. Let me look into a little more
|
||
// After processing all partitions, end the stream and write batch order indices | ||
// After processing all partitions, end the batch stream | ||
batchWriter.end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this and the code below should be in a finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we put it into a finally block but only catch SparkException
then that would be wrong: If a different exception gets thrown then we would go into case None
, end the stream as if nothing happened and only get partial, incorrect data on the python side.
If we want to put this into a finally block then we should catch all exceptions but I figured I'd do the same as in https://github.com/apache/spark/pull/24070/files#r279589039
It should be fine as is, if any exception that isn't a SparkException
gets thrown then we will never reach this code. Instead the OutputStream
just gets closed and we get an EofError
on the python side (like we do right now for all Exceptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any more thoughts on this @BryanCutler ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is fine
Yeah @felixcheung , it could be a nasty problem. I think the logs will show the job had an error, but the application python script would continue to run with partial results.. Let me verify this in branch-2.4. What are your thoughts on possibly backporting a fix? After checking it out with branch-2.4, it is possible to get partial results in Python, but the JVM error is shown and is pretty obvious. It's unfortunate the this wasn't caught earlier, but I don't think it's worth the risk to backport right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
merged to master, thanks @dvogelbacher ! |
ah ok |
case Some(exception) => | ||
// Signal failure and write error message | ||
out.writeInt(-1) | ||
PythonRDD.writeUTF(exception.getMessage, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late response, @BryanCutler
Yea, I had the same question #24677 (comment). Thanks for details at #24677 (comment). It would have been better if those are commented since at least two committers raised the same questions :-).
@HyukjinKwon do you think would it make sense to patch branch-2.4 with a manual fix?
Yea, I don't mind backporting it (don't strongly feel we should do too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too but .. I wonder if this is the only way.
@dvogelbacher would you mind if I ask to check normal collect()
code path too? Dataset.collectToPython()
looks like a same instance.
The |
…arrow enabled ## What changes were proposed in this pull request? Similar to apache#24070, we now propagate SparkExceptions that are encountered during the collect in the java process to the python process. Fixes https://jira.apache.org/jira/browse/SPARK-27805 ## How was this patch tested? Added a new unit test Closes apache#24677 from dvogelbacher/dv/betterErrorMsgWhenUsingArrow. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
…arrow enabled ## What changes were proposed in this pull request? Similar to apache#24070, we now propagate SparkExceptions that are encountered during the collect in the java process to the python process. Fixes https://jira.apache.org/jira/browse/SPARK-27805 ## How was this patch tested? Added a new unit test Closes apache#24677 from dvogelbacher/dv/betterErrorMsgWhenUsingArrow. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
…arrow enabled (#626) Cherry-pick of apache#24677
What changes were proposed in this pull request?
Similar to #24070, we now propagate SparkExceptions that are encountered during the collect in the java process to the python process.
Fixes https://jira.apache.org/jira/browse/SPARK-27805
How was this patch tested?
Added a new unit test