Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error #42818

Closed
wants to merge 6 commits into from

Conversation

juliuszsompolski
Copy link
Contributor

What changes were proposed in this pull request?

Make INVALID_CURSOR.DISCONNECTED a retriable error.

Why are the changes needed?

This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests will be added in #42560

Was this patch authored or co-authored using generative AI tooling?

No.

@juliuszsompolski juliuszsompolski marked this pull request as ready for review September 5, 2023 16:42
Comment on lines 606 to 614
if e.code() in [
grpc.StatusCode.INTERNAL
]:
# This error happens if another RPC preempts this RPC, retry.
msg_cursor_disconnected = "INVALID_CURSOR.DISCONNECTED"

msg = str(e)
if any(map(lambda x: x in msg, [msg_cursor_disconnected])):
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have some logic to convert the error messages from random RPC errors to spark understandable exceptions. I'm wondering if we should leverage this here as well.

pyspark.errors.exceptions.connect.convert()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a follow up.

@juliuszsompolski
Copy link
Contributor Author

https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544325699

[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite

flake

@juliuszsompolski
Copy link
Contributor Author

juliuszsompolski commented Sep 6, 2023

https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544375244

  test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... malloc(): unsorted double linked list corrupted
ERROR (612.473s)
  test_self_join (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... ERROR (611.358s)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
/__w/apache-spark/apache-spark/python/pyspark/context.py:657: RuntimeWarning: Unable to cleanly shutdown Spark JVM process. It is possible that the process has crashed, been killed or may also be in a zombie state.
  warnings.warn(
/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/reattach.py:219: UserWarning: ReleaseExecute failed with exception: Cannot invoke RPC on closed channel!.
  warnings.warn(f"ReleaseExecute failed with exception: {e}.")

======================================================================
ERROR [612.473s]: test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/connect/test_parity_pandas_map.py", line 26, in test_other_than_dataframe_iter
    self.check_other_than_dataframe_iter()
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 163, in check_other_than_dataframe_iter
    (self.spark.range(10, numPartitions=3).mapInPandas(bad_iter_elem, "a int").count())
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 252, in count
    pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 1703, in toPandas
    return self._session.client.to_pandas(query)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 873, in to_pandas
    table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1282, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(req):
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1263, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1502, in _handle_error
    self._handle_rpc_error(error)
  File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1542, in _handle_rpc_error
    raise SparkConnectGrpcException(str(rpc_error)) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused {created_time:"2023-09-06T13:29:14.012172836+00:00", grpc_status:14}"
>

I believe that after the server crashed with

  test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... malloc(): unsorted double linked list corrupted

The client kept retrying the UNAVAILABLE error (as expected), until it ran out of retries (the fact that it took 612 s is consistent that retries should take about 10 minutes)

So this is as expected (other than the server shouldn't be crashing with malloc memory corruption, but that's unrelated to this PR).

@HyukjinKwon HyukjinKwon changed the title [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error [SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error Sep 7, 2023
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.5.

HyukjinKwon pushed a commit that referenced this pull request Sep 7, 2023
…rror

### What changes were proposed in this pull request?

Make INVALID_CURSOR.DISCONNECTED a retriable error.

### Why are the changes needed?

This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests will be added in #42560

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42818 from juliuszsompolski/SPARK-44835.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f13743d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants