New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818
Conversation
if e.code() in [ | ||
grpc.StatusCode.INTERNAL | ||
]: | ||
# This error happens if another RPC preempts this RPC, retry. | ||
msg_cursor_disconnected = "INVALID_CURSOR.DISCONNECTED" | ||
|
||
msg = str(e) | ||
if any(map(lambda x: x in msg, [msg_cursor_disconnected])): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have some logic to convert the error messages from random RPC errors to spark understandable exceptions. I'm wondering if we should leverage this here as well.
pyspark.errors.exceptions.connect.convert()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in a follow up.
https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544325699
flake |
https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544375244
I believe that after the server crashed with
The client kept retrying the UNAVAILABLE error (as expected), until it ran out of retries (the fact that it took 612 s is consistent that retries should take about 10 minutes) So this is as expected (other than the server shouldn't be crashing with malloc memory corruption, but that's unrelated to this PR). |
...tor/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala
Outdated
Show resolved
Hide resolved
Merged to master and branch-3.5. |
…rror ### What changes were proposed in this pull request? Make INVALID_CURSOR.DISCONNECTED a retriable error. ### Why are the changes needed? This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests will be added in #42560 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42818 from juliuszsompolski/SPARK-44835. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f13743d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Make INVALID_CURSOR.DISCONNECTED a retriable error.
Why are the changes needed?
This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests will be added in #42560
Was this patch authored or co-authored using generative AI tooling?
No.