[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

juliuszsompolski · 2023-09-05T16:42:05Z

What changes were proposed in this pull request?

Make INVALID_CURSOR.DISCONNECTED a retriable error.

Why are the changes needed?

This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests will be added in #42560

Was this patch authored or co-authored using generative AI tooling?

No.

grundprinzip · 2023-09-06T07:54:59Z

python/pyspark/sql/connect/client/core.py

+        if e.code() in [
+            grpc.StatusCode.INTERNAL
+        ]:
+            # This error happens if another RPC preempts this RPC, retry.
+            msg_cursor_disconnected = "INVALID_CURSOR.DISCONNECTED"
+
+            msg = str(e)
+            if any(map(lambda x: x in msg, [msg_cursor_disconnected])):
+                return True


we have some logic to convert the error messages from random RPC errors to spark understandable exceptions. I'm wondering if we should leverage this here as well.

pyspark.errors.exceptions.connect.convert()

We can do this in a follow up.

juliuszsompolski · 2023-09-06T15:35:19Z

https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544325699

[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite

flake

juliuszsompolski · 2023-09-06T15:39:11Z

https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544375244

  test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... malloc(): unsorted double linked list corrupted
ERROR (612.473s)
  test_self_join (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... ERROR (611.358s)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
/__w/apache-spark/apache-spark/python/pyspark/context.py:657: RuntimeWarning: Unable to cleanly shutdown Spark JVM process. It is possible that the process has crashed, been killed or may also be in a zombie state.
  warnings.warn(
/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/reattach.py:219: UserWarning: ReleaseExecute failed with exception: Cannot invoke RPC on closed channel!.
  warnings.warn(f"ReleaseExecute failed with exception: {e}.")

======================================================================
ERROR [612.473s]: test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/connect/test_parity_pandas_map.py", line 26, in test_other_than_dataframe_iter
    self.check_other_than_dataframe_iter()
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 163, in check_other_than_dataframe_iter
    (self.spark.range(10, numPartitions=3).mapInPandas(bad_iter_elem, "a int").count())
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 252, in count
    pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 1703, in toPandas
    return self._session.client.to_pandas(query)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 873, in to_pandas
    table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1282, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(req):
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1263, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1502, in _handle_error
    self._handle_rpc_error(error)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1542, in _handle_rpc_error
    raise SparkConnectGrpcException(str(rpc_error)) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused {created_time:"2023-09-06T13:29:14.012172836+00:00", grpc_status:14}"
>

I believe that after the server crashed with

  test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... malloc(): unsorted double linked list corrupted

The client kept retrying the UNAVAILABLE error (as expected), until it ran out of retries (the fact that it took 612 s is consistent that retries should take about 10 minutes)

So this is as expected (other than the server shouldn't be crashing with malloc memory corruption, but that's unrelated to this PR).

...tor/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala

HyukjinKwon · 2023-09-07T01:48:16Z

Merged to master and branch-3.5.

…rror ### What changes were proposed in this pull request? Make INVALID_CURSOR.DISCONNECTED a retriable error. ### Why are the changes needed? This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests will be added in #42560 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42818 from juliuszsompolski/SPARK-44835. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f13743d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

rety policies

bb4bb41

juliuszsompolski marked this pull request as ready for review September 5, 2023 16:42

github-actions bot added SQL PYTHON CONNECT labels Sep 5, 2023

retrigger CI

ed77240

grundprinzip reviewed Sep 6, 2023

View reviewed changes

grundprinzip approved these changes Sep 6, 2023

View reviewed changes

juliuszsompolski added 2 commits September 6, 2023 12:47

python lint

1b8f3a0

decomplicat

7eca093

retrigger CI

c9677ea

hvanhovell reviewed Sep 6, 2023

View reviewed changes

...tor/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala Outdated Show resolved Hide resolved

nit

72eee4e

HyukjinKwon changed the title ~~[SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error~~ [SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error Sep 7, 2023

HyukjinKwon approved these changes Sep 7, 2023

View reviewed changes

HyukjinKwon closed this in f13743d Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

juliuszsompolski commented Sep 5, 2023

grundprinzip Sep 6, 2023

grundprinzip Sep 6, 2023

juliuszsompolski commented Sep 6, 2023

juliuszsompolski commented Sep 6, 2023 •

edited

HyukjinKwon commented Sep 7, 2023

[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

Conversation

juliuszsompolski commented Sep 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

grundprinzip Sep 6, 2023

Choose a reason for hiding this comment

grundprinzip Sep 6, 2023

Choose a reason for hiding this comment

juliuszsompolski commented Sep 6, 2023

juliuszsompolski commented Sep 6, 2023 • edited

HyukjinKwon commented Sep 7, 2023

juliuszsompolski commented Sep 6, 2023 •

edited