[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

itholic · 2022-12-20T10:33:41Z

What changes were proposed in this pull request?

This PR proposes to introduce pyspark.errors and error classes to unifying & improving errors generated by PySpark under a single path.

This PR includes the changes below:

Add error classes for PySpark and its sub error classes into error-classes.json.
Add PySparkErrors in JVM side to leverage the existing error framework
Add new module: pyspark.errors
Add new errors defined in pyspark.errors.errors that return the PySparkException by leveraging new error classes.
Migrate the errors into error classes for pyspark/sql/functions.py
Add tests for migrated errors.
Add test util check_error for testing errors by its error classes.

This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users.

While such an active work is being done on the SQL side to improve error messages, so far there is no work to improve error messages in PySpark.

So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR.

Next up for this PR include:

Migrate more Python built-in exceptions generated by driver side into PySpark-specific errors.
Migrate the errors generated by Py4J into PySpark-specific errors.
Migrate the errors generated by Python worker side into PySpark-specific errors.
Migrate more error tests into tests using checkError.
Currently all PySpark-specific errors are defined as PySparkException class. As the number of PySpark-specific errors increases in the future, it may be necessary to further refine the PySparkException into multiple categories
Add documentation

Will add more items to umbrella JIRA once initial PR get approved.

Why are the changes needed?

Centralizing error messages & introducing identified error class provides the following benefits:

Errors are searchable via the unique class names and properly classified.
Reduce the cost of future maintenance for PySpark errors.
Provide consistent & actionable error messages to users.
Facilitates translating error messages into different languages.

Does this PR introduce any user-facing change?

Yes, but only for error message. No API changes at all.

For example,

Before

>>> from pyspark.sql import functions as F
>>> F.window("date", 5)
Traceback (most recent call last):
...
TypeError: windowDuration should be provided as a string

After

Traceback (most recent call last):
...
pyspark.errors.exceptions.PySparkException: [PYSPARK.NOT_A_STRING]  Argument 'windowDuration' should be a string, got 'int'.

How was this patch tested?

By adding unittests and manually test the static analysis from dev/lint-python

itholic · 2022-12-20T10:38:28Z

core/src/main/resources/error/error-classes.json

@@ -1046,6 +1046,63 @@
      "Protobuf type not yet supported: <protobufType>."
    ]
  },
+  "PYSPARK" : {


Later, when the amount of error classes becomes large enough to be categorized, it can be subdivided into several error classes starting with PYSPARK.

e.g. "PYSPARK_INVALID_TYPE", "PYSPARK_WRONG_NUM_ARGS" or something.

itholic · 2022-12-20T10:40:04Z

python/pyspark/errors/exceptions.py

+from pyspark.sql.utils import CapturedException
+
+
+class PySparkException(CapturedException):


As mentioned in PR description, currently all PySpark-specific errors are defined as PySparkException class.
It might be necessary to refine the PySparkException into multiple categories as the number of PySpark-specific errors increases in the future.

Currently, pyspark.sql.utils defines some classes to handle the errors from Python worker and Py4J.
I plan to integrate them with pyspark.errors package as follow-up when I work on migrating the errors generated by Python worker and Py4J.

I think the inheritance hierarchy here is a bit odd since PySparkException isn't technically CapturedException (captured from Py4J)

That makes sense, PySparkException should be higher up in the hierarchy than CapturedException.

Thanks for catching this! Let me update it.

itholic · 2022-12-20T10:40:38Z

python/pyspark/errors/errors.py

+    return spark._jvm.org.apache.spark.python.errors.PySparkErrors
+
+
+def columnInListError(func_name: str) -> "PySparkException":


The name of errors should be sufficiently descriptive of the error.

I would appreciate for any comment to improvement the naming.

~~I follows the camelCase naming rule for error names to facilitate matching with errors defined on the JVM side (sql/catalyst/src/main/scala/org/apache/spark/python/errors/PySparkErrors.scala)~~

We use snake_case according to #39137 (comment).

core/src/main/resources/error/error-classes.json

python/pyspark/sql/utils.py

python/pyspark/errors/errors.py

…_errors

zhengruifeng · 2022-12-23T08:11:25Z

core/src/main/resources/error/error-classes.json

+      },
+      "NOT_COLUMN_OR_INTEGER" : {
+        "message" : [
+          "Argument `<argName>` should be a column or integer, got <argType>."


dumb question: can a error message be parameterized?

Yeah, there is a framework on JVM side to handle this logic.
These parameters from error-classes.json is constructed from SparkThrowable and SparkThrowableHelper.
And, all Exceptions should inherits the SparkThrowable for leveraging this centralized error message framework.
You can check the Guidelines for more detail :-)

zhengruifeng · 2022-12-23T08:12:47Z

sql/catalyst/src/main/scala/org/apache/spark/python/errors/PySparkErrors.scala

+/**
+ * Object for grouping error messages from exceptions thrown by PySpark.
+ */
+private[python] object PySparkErrors {


why we need to touch the scala side?

Because we want to leverage the centralized existing error framework and its error classes from JVM side.
You can refer to QueryExecutionErrors.scala for example.

Wouldn't it be better to simply leverage the intent of the error classes then trying to push yet another link to the JVM? Why not just add an error class json file in PySpark?

…_errors

zhengruifeng · 2022-12-26T05:27:24Z

python/pyspark/errors/errors.py

+
+    spark = SparkSession._getActiveSessionOrCreate()
+    assert spark._jvm is not None
+    return spark._jvm.org.apache.spark.python.errors.PySparkErrors


if we are going to support Spark Connect, I guess we can not invoke the JVM side in Python Client

Yup, that's correct.
So I think we should design the error handling logic for the Spark Connect separately.
That is one of plan I'm thinking about as follow-up.
We may need to build Python-specific error framework for covering this case such as Spark Connect.

And that's why the current CI is failing 😂
I'm taking a look at this one.

grundprinzip

Generally, this PR is great to have. However, I think it would be good to avoid the JVM dependency when only needed to read a JSON file.

Let's identify how we can capture the error classes JSON file as a build artifact from the main Spark build instead. If it is impossible, let's just add a new file and integrate it with the schema checking in Spark so that we're sure to have always the right format.

grundprinzip · 2022-12-26T07:20:49Z

python/pyspark/errors/__init__.py

+    "PySparkException",
+    "columnInListError",
+    "higherOrderFunctionShouldReturnColumnError",
+    "notColumnError",


Exporting a function that returns an instance of an error seems weird and indicates that the constructor is not well designed.

grundprinzip · 2022-12-26T07:22:38Z

python/pyspark/errors/exceptions.py

+from py4j.java_gateway import JavaObject, is_instance_of
+
+
+class PySparkException(Exception):


Given that Spark Connect does not use the JVM shouldn't most abstract error be one without JVM?

grundprinzip · 2022-12-26T07:23:12Z

python/pyspark/errors/errors.py

+
+def column_in_list_error(func_name: str) -> "PySparkException":
+    pyspark_errors = _get_pyspark_errors()
+    e = pyspark_errors.columnInListError(func_name)


Invoking a function on a Scala object just to access a JSON file feels wrong to me.

grundprinzip · 2022-12-26T07:24:32Z

python/pyspark/errors/exceptions.py

+        debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+        desc = self.desc
+        if debug_enabled:
+            desc = desc + "\n\nJVM stacktrace:\n%s" % self.stackTrace


Why is this a JVM backtrace?

grundprinzip · 2022-12-26T07:25:09Z

python/pyspark/sql/functions.py

@@ -172,7 +184,7 @@ def lit(col: Any) -> Column:
        return col
    elif isinstance(col, list):
        if any(isinstance(c, Column) for c in col):
-            raise ValueError("lit does not allow a column in a list")
+            raise column_in_list_error(func_name="lit")


I find the readability is reduced with those function names.

grundprinzip · 2022-12-26T07:27:35Z

python/pyspark/sql/utils.py

@@ -73,39 +74,6 @@ def __init__(
            self.cause = convert_exception(origin.getCause())
        self._origin = origin

-    def __str__(self) -> str:


I think it might make sense to move this back. The captured exception is actually thrown from the JVM and thus the place where a JVM backtrace is actually present. So it much rather belongs here than in the parent class.

itholic · 2022-12-26T09:07:06Z

Thanks @grundprinzip for the review.
I agree that your comments and feel it's pretty reasonable.

Actually, I once submitted a PR that implemented the framework on PySpark-side (#39128) that has no dependency with JVM.

But I closed the previous one and re-open this PR for following reason:

I worried that maybe it would not be easy to maintenance when the rules on one side (PySpark vs JVM) were arbitrarily changed in the future. So, I wanted to manage all errors in a single error class file(error-class.json) across the entire Apache Spark project to reduce the management cost.
I thought I might see an advantage in that we can simply reuse the existing error class as it is without adding a new one when there is a similar error already defined on the JVM side in the future.
Like the functions in functions.py , most of PySpark's functions leverage the JVM's logic, so it is assumed that the JVM will run at least once. So I thought that calling the error implemented in is acceptable for the expected overhead.

But regardless of these reasons, I think all of your comments also are pretty reasonable.

So, could you take a roughly look at the changes of the previous PR when you find some time??

If the approach of the previous PR which implements separate logic on the PySpark side without relying on the JVM feels more reasonable for you, let me consider the overall design again.

also cc @HyukjinKwon FYI

grundprinzip · 2022-12-26T11:14:14Z

I worried that maybe it would not be easy to maintenance when the rules on one side (PySpark vs JVM) were arbitrarily changed in the future. So, I wanted to manage all errors in a single error class file(error-class.json) across the entire Apache Spark project to reduce the management cost.

I think it's fine to reuse the error-class.json file. Let's figure out a way to bundle it as part of the PySpark build. If we can't do that, let's add a python-error-class.json and verify its integrity in SparkThrowableSuite the same way as we do for the error-class.json.

I thought I might see an advantage in that we can simply reuse the existing error class as it is without adding a new one when there is a similar error already defined on the JVM side in the future.

I'm not sure how much of a benefit that we get here. For all server side exceptions they're thrown as SparkExceptions anyways and will be properly handled. This is where the CapturedException comes in. For client side exceptions, we should use the conceptually same approach but in a language idiomatic way.

Like the functions in functions.py , most of PySpark's functions leverage the JVM's logic, so it is assumed that the JVM will run at least once. So I thought that calling the error implemented in is acceptable for the expected overhead.

The functions that are handled in this PR are purely client side and should not require the vm traverse.

Since the primary reason for SparkThrowableHelper is to provide a lookup in a JSON file and map parameters correctly, I feel that this behavior should be replicated in all of the clients (PySpark, PySpark with SparkConnect, Scala for Spark Connect etc) but follow the message format.

@HyukjinKwon what is your opinion?

[SPARK-41586][PYTHON] Introduce PySpark errors package and error classes

d29ecd2

github-actions bot added CORE PYTHON SQL labels Dec 20, 2022

itholic commented Dec 20, 2022

View reviewed changes

reorder the error class

b5d1ade

itholic marked this pull request as ready for review December 20, 2022 12:39

MaxGekk reviewed Dec 20, 2022

View reviewed changes

core/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

core/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

core/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 21, 2022

View reviewed changes

python/pyspark/sql/utils.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 21, 2022

View reviewed changes

python/pyspark/errors/errors.py Outdated Show resolved Hide resolved

itholic added 3 commits December 21, 2022 12:13

Resolved comments: error class format, class hierarchy

5f1def6

Use snake_case for error names

5388383

Merge branch 'master' of https://github.com/apache/spark into pyspark…

c7db768

…_errors

zhengruifeng reviewed Dec 23, 2022

View reviewed changes

HyukjinKwon mentioned this pull request Dec 26, 2022

[SPARK-41533][CONNECT] Proper Error Handling for Spark Connect Server / Client #39212

Closed

Merge branch 'master' of https://github.com/apache/spark into pyspark…

0ed8839

…_errors

zhengruifeng reviewed Dec 26, 2022

View reviewed changes

Fix tests

0d29aab

github-actions bot added the CONNECT label Dec 26, 2022

grundprinzip reviewed Dec 26, 2022

View reviewed changes

itholic marked this pull request as draft December 29, 2022 02:13

itholic closed this Jan 4, 2023

itholic deleted the pyspark_errors branch April 22, 2023 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

itholic commented Dec 20, 2022 •

edited

itholic Dec 20, 2022

itholic Dec 20, 2022

itholic Dec 20, 2022 •

edited

HyukjinKwon Dec 21, 2022

itholic Dec 21, 2022

itholic Dec 20, 2022

itholic Dec 20, 2022 •

edited

zhengruifeng Dec 23, 2022

itholic Dec 26, 2022

zhengruifeng Dec 23, 2022

itholic Dec 26, 2022 •

edited

grundprinzip Dec 26, 2022

zhengruifeng Dec 26, 2022

itholic Dec 26, 2022 •

edited

itholic Dec 26, 2022

grundprinzip left a comment

grundprinzip Dec 26, 2022

grundprinzip Dec 26, 2022

grundprinzip Dec 26, 2022

grundprinzip Dec 26, 2022

grundprinzip Dec 26, 2022

grundprinzip Dec 26, 2022

itholic commented Dec 26, 2022

grundprinzip commented Dec 26, 2022

		from pyspark.sql.utils import CapturedException


		class PySparkException(CapturedException):

		return spark._jvm.org.apache.spark.python.errors.PySparkErrors


		def columnInListError(func_name: str) -> "PySparkException":

		from py4j.java_gateway import JavaObject, is_instance_of


		class PySparkException(Exception):

[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

Conversation

itholic commented Dec 20, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Dec 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Dec 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Dec 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Dec 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grundprinzip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Dec 26, 2022

grundprinzip commented Dec 26, 2022

itholic commented Dec 20, 2022 •

edited

itholic Dec 20, 2022 •

edited

itholic Dec 20, 2022 •

edited

itholic Dec 26, 2022 •

edited

itholic Dec 26, 2022 •

edited