Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41586][SPARK-41598][PYTHON] Introduce PySpark errors package and error classes #39137

Closed
wants to merge 7 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Dec 20, 2022

What changes were proposed in this pull request?

This PR proposes to introduce pyspark.errors and error classes to unifying & improving errors generated by PySpark under a single path.

This PR includes the changes below:

  • Add error classes for PySpark and its sub error classes into error-classes.json.
  • Add PySparkErrors in JVM side to leverage the existing error framework
  • Add new module: pyspark.errors
  • Add new errors defined in pyspark.errors.errors that return the PySparkException by leveraging new error classes.
  • Migrate the errors into error classes for pyspark/sql/functions.py
  • Add tests for migrated errors.
  • Add test util check_error for testing errors by its error classes.

This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users.

While such an active work is being done on the SQL side to improve error messages, so far there is no work to improve error messages in PySpark.

So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR.

Next up for this PR include:

  • Migrate more Python built-in exceptions generated by driver side into PySpark-specific errors.
  • Migrate the errors generated by Py4J into PySpark-specific errors.
  • Migrate the errors generated by Python worker side into PySpark-specific errors.
  • Migrate more error tests into tests using checkError.
  • Currently all PySpark-specific errors are defined as PySparkException class. As the number of PySpark-specific errors increases in the future, it may be necessary to further refine the PySparkException into multiple categories
  • Add documentation

Will add more items to umbrella JIRA once initial PR get approved.

Why are the changes needed?

Centralizing error messages & introducing identified error class provides the following benefits:

  • Errors are searchable via the unique class names and properly classified.
  • Reduce the cost of future maintenance for PySpark errors.
  • Provide consistent & actionable error messages to users.
  • Facilitates translating error messages into different languages.

Does this PR introduce any user-facing change?

Yes, but only for error message. No API changes at all.

For example,

Before

>>> from pyspark.sql import functions as F
>>> F.window("date", 5)
Traceback (most recent call last):
...
TypeError: windowDuration should be provided as a string

After

Traceback (most recent call last):
...
pyspark.errors.exceptions.PySparkException: [PYSPARK.NOT_A_STRING]  Argument 'windowDuration' should be a string, got 'int'.

How was this patch tested?

By adding unittests and manually test the static analysis from dev/lint-python

@@ -1046,6 +1046,63 @@
"Protobuf type not yet supported: <protobufType>."
]
},
"PYSPARK" : {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, when the amount of error classes becomes large enough to be categorized, it can be subdivided into several error classes starting with PYSPARK.

e.g. "PYSPARK_INVALID_TYPE", "PYSPARK_WRONG_NUM_ARGS" or something.

from pyspark.sql.utils import CapturedException


class PySparkException(CapturedException):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in PR description, currently all PySpark-specific errors are defined as PySparkException class.
It might be necessary to refine the PySparkException into multiple categories as the number of PySpark-specific errors increases in the future.

Copy link
Contributor Author

@itholic itholic Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, pyspark.sql.utils defines some classes to handle the errors from Python worker and Py4J.
I plan to integrate them with pyspark.errors package as follow-up when I work on migrating the errors generated by Python worker and Py4J.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the inheritance hierarchy here is a bit odd since PySparkException isn't technically CapturedException (captured from Py4J)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, PySparkException should be higher up in the hierarchy than CapturedException.

Thanks for catching this! Let me update it.

return spark._jvm.org.apache.spark.python.errors.PySparkErrors


def columnInListError(func_name: str) -> "PySparkException":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of errors should be sufficiently descriptive of the error.

I would appreciate for any comment to improvement the naming.

Copy link
Contributor Author

@itholic itholic Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follows the camelCase naming rule for error names to facilitate matching with errors defined on the JVM side (sql/catalyst/src/main/scala/org/apache/spark/python/errors/PySparkErrors.scala)

We use snake_case according to #39137 (comment).

@itholic itholic marked this pull request as ready for review December 20, 2022 12:39
},
"NOT_COLUMN_OR_INTEGER" : {
"message" : [
"Argument `<argName>` should be a column or integer, got <argType>."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: can a error message be parameterized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is a framework on JVM side to handle this logic.
These parameters from error-classes.json is constructed from SparkThrowable and SparkThrowableHelper.
And, all Exceptions should inherits the SparkThrowable for leveraging this centralized error message framework.
You can check the Guidelines for more detail :-)

/**
* Object for grouping error messages from exceptions thrown by PySpark.
*/
private[python] object PySparkErrors {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to touch the scala side?

Copy link
Contributor Author

@itholic itholic Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to leverage the centralized existing error framework and its error classes from JVM side.
You can refer to QueryExecutionErrors.scala for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to simply leverage the intent of the error classes then trying to push yet another link to the JVM? Why not just add an error class json file in PySpark?


spark = SparkSession._getActiveSessionOrCreate()
assert spark._jvm is not None
return spark._jvm.org.apache.spark.python.errors.PySparkErrors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to support Spark Connect, I guess we can not invoke the JVM side in Python Client

Copy link
Contributor Author

@itholic itholic Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's correct.
So I think we should design the error handling logic for the Spark Connect separately.
That is one of plan I'm thinking about as follow-up.
We may need to build Python-specific error framework for covering this case such as Spark Connect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's why the current CI is failing 😂
I'm taking a look at this one.

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this PR is great to have. However, I think it would be good to avoid the JVM dependency when only needed to read a JSON file.

Let's identify how we can capture the error classes JSON file as a build artifact from the main Spark build instead. If it is impossible, let's just add a new file and integrate it with the schema checking in Spark so that we're sure to have always the right format.

"PySparkException",
"columnInListError",
"higherOrderFunctionShouldReturnColumnError",
"notColumnError",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exporting a function that returns an instance of an error seems weird and indicates that the constructor is not well designed.

from py4j.java_gateway import JavaObject, is_instance_of


class PySparkException(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Spark Connect does not use the JVM shouldn't most abstract error be one without JVM?


def column_in_list_error(func_name: str) -> "PySparkException":
pyspark_errors = _get_pyspark_errors()
e = pyspark_errors.columnInListError(func_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invoking a function on a Scala object just to access a JSON file feels wrong to me.

debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
desc = self.desc
if debug_enabled:
desc = desc + "\n\nJVM stacktrace:\n%s" % self.stackTrace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a JVM backtrace?

@@ -172,7 +184,7 @@ def lit(col: Any) -> Column:
return col
elif isinstance(col, list):
if any(isinstance(c, Column) for c in col):
raise ValueError("lit does not allow a column in a list")
raise column_in_list_error(func_name="lit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the readability is reduced with those function names.

@@ -73,39 +74,6 @@ def __init__(
self.cause = convert_exception(origin.getCause())
self._origin = origin

def __str__(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to move this back. The captured exception is actually thrown from the JVM and thus the place where a JVM backtrace is actually present. So it much rather belongs here than in the parent class.

@itholic
Copy link
Contributor Author

itholic commented Dec 26, 2022

Thanks @grundprinzip for the review.
I agree that your comments and feel it's pretty reasonable.

Actually, I once submitted a PR that implemented the framework on PySpark-side (#39128) that has no dependency with JVM.

But I closed the previous one and re-open this PR for following reason:

  1. I worried that maybe it would not be easy to maintenance when the rules on one side (PySpark vs JVM) were arbitrarily changed in the future. So, I wanted to manage all errors in a single error class file(error-class.json) across the entire Apache Spark project to reduce the management cost.
  2. I thought I might see an advantage in that we can simply reuse the existing error class as it is without adding a new one when there is a similar error already defined on the JVM side in the future.
  3. Like the functions in functions.py , most of PySpark's functions leverage the JVM's logic, so it is assumed that the JVM will run at least once. So I thought that calling the error implemented in is acceptable for the expected overhead.

But regardless of these reasons, I think all of your comments also are pretty reasonable.

So, could you take a roughly look at the changes of the previous PR when you find some time??

If the approach of the previous PR which implements separate logic on the PySpark side without relying on the JVM feels more reasonable for you, let me consider the overall design again.

also cc @HyukjinKwon FYI

@grundprinzip
Copy link
Contributor

  • I worried that maybe it would not be easy to maintenance when the rules on one side (PySpark vs JVM) were arbitrarily changed in the future. So, I wanted to manage all errors in a single error class file(error-class.json) across the entire Apache Spark project to reduce the management cost.

I think it's fine to reuse the error-class.json file. Let's figure out a way to bundle it as part of the PySpark build. If we can't do that, let's add a python-error-class.json and verify its integrity in SparkThrowableSuite the same way as we do for the error-class.json.

  • I thought I might see an advantage in that we can simply reuse the existing error class as it is without adding a new one when there is a similar error already defined on the JVM side in the future.

I'm not sure how much of a benefit that we get here. For all server side exceptions they're thrown as SparkExceptions anyways and will be properly handled. This is where the CapturedException comes in. For client side exceptions, we should use the conceptually same approach but in a language idiomatic way.

  • Like the functions in functions.py , most of PySpark's functions leverage the JVM's logic, so it is assumed that the JVM will run at least once. So I thought that calling the error implemented in is acceptable for the expected overhead.

The functions that are handled in this PR are purely client side and should not require the vm traverse.

Since the primary reason for SparkThrowableHelper is to provide a lookup in a JSON file and map parameters correctly, I feel that this behavior should be replicated in all of the clients (PySpark, PySpark with SparkConnect, Scala for Spark Connect etc) but follow the message format.

@HyukjinKwon what is your opinion?

@itholic itholic marked this pull request as draft December 29, 2022 02:13
@itholic itholic closed this Jan 4, 2023
@itholic itholic deleted the pyspark_errors branch April 22, 2023 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants