New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137
Conversation
@@ -1046,6 +1046,63 @@ | |||
"Protobuf type not yet supported: <protobufType>." | |||
] | |||
}, | |||
"PYSPARK" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, when the amount of error classes becomes large enough to be categorized, it can be subdivided into several error classes starting with PYSPARK.
e.g. "PYSPARK_INVALID_TYPE", "PYSPARK_WRONG_NUM_ARGS" or something.
python/pyspark/errors/exceptions.py
Outdated
from pyspark.sql.utils import CapturedException | ||
|
||
|
||
class PySparkException(CapturedException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in PR description, currently all PySpark-specific errors are defined as PySparkException
class.
It might be necessary to refine the PySparkException
into multiple categories as the number of PySpark-specific errors increases in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, pyspark.sql.utils
defines some classes to handle the errors from Python worker and Py4J.
I plan to integrate them with pyspark.errors
package as follow-up when I work on migrating the errors generated by Python worker and Py4J.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the inheritance hierarchy here is a bit odd since PySparkException
isn't technically CapturedException
(captured from Py4J)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, PySparkException
should be higher up in the hierarchy than CapturedException
.
Thanks for catching this! Let me update it.
python/pyspark/errors/errors.py
Outdated
return spark._jvm.org.apache.spark.python.errors.PySparkErrors | ||
|
||
|
||
def columnInListError(func_name: str) -> "PySparkException": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of errors should be sufficiently descriptive of the error.
I would appreciate for any comment to improvement the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follows the camelCase naming rule for error names to facilitate matching with errors defined on the JVM side (sql/catalyst/src/main/scala/org/apache/spark/python/errors/PySparkErrors.scala
)
We use snake_case according to #39137 (comment).
}, | ||
"NOT_COLUMN_OR_INTEGER" : { | ||
"message" : [ | ||
"Argument `<argName>` should be a column or integer, got <argType>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb question: can a error message be parameterized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is a framework on JVM side to handle this logic.
These parameters from error-classes.json
is constructed from SparkThrowable
and SparkThrowableHelper
.
And, all Exceptions should inherits the SparkThrowable
for leveraging this centralized error message framework.
You can check the Guidelines for more detail :-)
/** | ||
* Object for grouping error messages from exceptions thrown by PySpark. | ||
*/ | ||
private[python] object PySparkErrors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to touch the scala side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want to leverage the centralized existing error framework and its error classes from JVM side.
You can refer to QueryExecutionErrors.scala for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to simply leverage the intent of the error classes then trying to push yet another link to the JVM? Why not just add an error class json file in PySpark?
|
||
spark = SparkSession._getActiveSessionOrCreate() | ||
assert spark._jvm is not None | ||
return spark._jvm.org.apache.spark.python.errors.PySparkErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to support Spark Connect, I guess we can not invoke the JVM side in Python Client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's correct.
So I think we should design the error handling logic for the Spark Connect separately.
That is one of plan I'm thinking about as follow-up.
We may need to build Python-specific error framework for covering this case such as Spark Connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that's why the current CI is failing 😂
I'm taking a look at this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, this PR is great to have. However, I think it would be good to avoid the JVM dependency when only needed to read a JSON file.
Let's identify how we can capture the error classes JSON file as a build artifact from the main Spark build instead. If it is impossible, let's just add a new file and integrate it with the schema checking in Spark so that we're sure to have always the right format.
"PySparkException", | ||
"columnInListError", | ||
"higherOrderFunctionShouldReturnColumnError", | ||
"notColumnError", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exporting a function that returns an instance of an error seems weird and indicates that the constructor is not well designed.
from py4j.java_gateway import JavaObject, is_instance_of | ||
|
||
|
||
class PySparkException(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Spark Connect does not use the JVM shouldn't most abstract error be one without JVM?
|
||
def column_in_list_error(func_name: str) -> "PySparkException": | ||
pyspark_errors = _get_pyspark_errors() | ||
e = pyspark_errors.columnInListError(func_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invoking a function on a Scala object just to access a JSON file feels wrong to me.
debug_enabled = sql_conf.pysparkJVMStacktraceEnabled() | ||
desc = self.desc | ||
if debug_enabled: | ||
desc = desc + "\n\nJVM stacktrace:\n%s" % self.stackTrace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a JVM backtrace?
@@ -172,7 +184,7 @@ def lit(col: Any) -> Column: | |||
return col | |||
elif isinstance(col, list): | |||
if any(isinstance(c, Column) for c in col): | |||
raise ValueError("lit does not allow a column in a list") | |||
raise column_in_list_error(func_name="lit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the readability is reduced with those function names.
@@ -73,39 +74,6 @@ def __init__( | |||
self.cause = convert_exception(origin.getCause()) | |||
self._origin = origin | |||
|
|||
def __str__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make sense to move this back. The captured exception is actually thrown from the JVM and thus the place where a JVM backtrace is actually present. So it much rather belongs here than in the parent class.
Thanks @grundprinzip for the review. Actually, I once submitted a PR that implemented the framework on PySpark-side (#39128) that has no dependency with JVM. But I closed the previous one and re-open this PR for following reason:
But regardless of these reasons, I think all of your comments also are pretty reasonable. So, could you take a roughly look at the changes of the previous PR when you find some time?? If the approach of the previous PR which implements separate logic on the PySpark side without relying on the JVM feels more reasonable for you, let me consider the overall design again. also cc @HyukjinKwon FYI |
I think it's fine to reuse the
I'm not sure how much of a benefit that we get here. For all server side exceptions they're thrown as SparkExceptions anyways and will be properly handled. This is where the
The functions that are handled in this PR are purely client side and should not require the vm traverse. Since the primary reason for @HyukjinKwon what is your opinion? |
What changes were proposed in this pull request?
This PR proposes to introduce
pyspark.errors
and error classes to unifying & improving errors generated by PySpark under a single path.This PR includes the changes below:
error-classes.json
.PySparkErrors
in JVM side to leverage the existing error frameworkpyspark.errors
pyspark.errors.errors
that return thePySparkException
by leveraging new error classes.pyspark/sql/functions.py
check_error
for testing errors by its error classes.This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users.
While such an active work is being done on the SQL side to improve error messages, so far there is no work to improve error messages in PySpark.
So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR.
Next up for this PR include:
Py4J
into PySpark-specific errors.checkError
.PySparkException
class. As the number of PySpark-specific errors increases in the future, it may be necessary to further refine thePySparkException
into multiple categoriesWill add more items to umbrella JIRA once initial PR get approved.
Why are the changes needed?
Centralizing error messages & introducing identified error class provides the following benefits:
Does this PR introduce any user-facing change?
Yes, but only for error message. No API changes at all.
For example,
Before
After
How was this patch tested?
By adding unittests and manually test the static analysis from
dev/lint-python