-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34920][CORE][SQL] Add error classes with SQLSTATE #32850
Conversation
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #139595 has finished for PR 32850 at commit
|
Test build #139596 has finished for PR 32850 at commit
|
"sqlState" : "40000", | ||
"messageFormatLines" : [ "Writing job aborted" ] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean sqlState
always unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily; sqlState
can be re-used, especially for common error classes without known subclasses, like 42000
(syntax or semantic error). In those cases, the error class will be the disambiguator.
|
||
## Fields | ||
|
||
All fields, excluding error messages, should be consistent across releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "consistent"? we can't change the error message in a new release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a pretty strict requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages can (and probably will) change between releases; that's why I included the caveat of "excluding error messages." This is actually a major driver for this - so we can improve the quality of error messages over time.
To clarify, I think it would be beneficial for our user base if we were consistent across error class and SQLSTATE across releases. With consistent error classes, users can build in known work-arounds or catch and re-throw known error types. With consistent SQLSTATEs, clients will also have predictable behavior.
|
||
To throw an exception, do the following. | ||
|
||
1. Check if an appropriate error class already exists in `error-class.json`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern that this will be another burden. E.g. naming a error class, finding error class.
I also worry that the number of error classes will grow quickly and hard to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, the error messages thrown from Spark are distributed across the entire code base with no single source of truth - as a result, it is hard to audit error message for quality and redundancy. With error classes, I'm hoping that we can improve the auditing process. However, I do recognize that the number of error classes will likely grow quickly, and that maintaining a high level of quality will require vigilant pruning. What do you think will help reduce the burden here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to load these error states from a json file instead of defining them statically in a companion object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think json file is more general for auditing, doc generation, internationalization (each error message has translations for different languages).
* The error message is constructed by concatenating the lines with | ||
* linebreaks. | ||
*/ | ||
case class ErrorInfo(sqlState: Option[String], messageFormatLines: Seq[String]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined in core. But sql state is a SQL concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; SQLSTATE is a SQL concept. Off the top of my head, I'm not sure how to have this only be set in the sql
component without increasing the complexity of the implementation:
- Create a JSON file that maps error classes to SQLSTATE in SQL, which may increase maintenance burden and decrease developer usability
- Create a
SqlErrorInfo(sqlState, messageFormatLines)
class that extendsErrorInfo(messageFormatLines)
in thesql
component, and parse the JSON file again to get thesqlState
field
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the implementation difficulty, I think it also makes sense to have sqlState
be accessible from any type of Spark exception (regardless of the base component), given that we currently throw SparkException
s (from the core
component) during query execution. We could introduce a SparkSqlError
type, but we would have to do a major refactor of the existing exception types as well in that case, as well as a cleaner division between the core
and sql
components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have a unified representation for all the errors. "error class" and "message" are required fields, and it's ok to have more optional fields for other purposes, like "sqlState" for JDBC compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, maybe we can consider giving a general name for sqlState
(e.g. errorState, errorStateNumber, or etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres calls their SQLSTATEs error codes, but I'm not sure if it'd be safe for us to do the same thing, especially given that we could add Spark-specific error codes in the future that span across non-SQL functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer sqlState
, to clearly indicate what it is, since SQL state is a standard.
"DUPLICATE_KEY_ERROR" : { | ||
"sqlState" : "23000", | ||
"messageFormatLines" : [ "Found duplicate keys '%s'" ] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add vendorCode
in the future, but I think we would need to determine how error codes should be defined - whether there should be a class hierarchy, if they should be arbitrary, or something else (eg. hashed from the error class). There are some clients that except the vendor code to match Hive error code classes, given that they are thrown within a HiveSQLException
.
@@ -316,7 +316,8 @@ object QueryParsingErrors { | |||
} | |||
|
|||
def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { | |||
new ParseException(s"Found duplicate keys '$key'.", ctx) | |||
// Found duplicate keys '$key' | |||
new ParseException(errorClass = "DUPLICATE_KEY_ERROR", messageParameters = Seq(key), ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding is used here. Looks hard to maintain. This is why I use objects:
https://github.com/apache/spark/blob/89c6f523c832daf7cff5c1ba8aecfbcd5494d5af/core/src/main/scala/org/apache/spark/errors/coreErrors.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a Scala-focused developer perspective, I agree that objects are easier to maintain. However, using string classes has a couple of benefits:
- The JSON file can be easily translated into a auto-generated documentation page (without tricks like reflection)
- Other clients (eg. Python and R) can natively throw errors with error classes, creating a single easily-auditable source of truth
What do you think? For a happy medium, we could auto-generate Scala from JSON, but I think that may increase maintenance burden.
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
||
To throw an exception, do the following. | ||
|
||
1. Check if an appropriate error class already exists in `error-class.json`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to load these error states from a json file instead of defining them statically in a companion object?
"DIVIDE_BY_ZERO" : { | ||
"sqlState" : "22012", | ||
"message" : [ "divide by zero" ] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment; it would be nice to automatically generate a user's doc page (just like the PostgreSQL one) from this error state definition file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning for having all of the error states is linked to having everything in a JSON file - it's relatively straightforward to generate docs pages. (It's also easier to use the same error states in Python/R.) I have code lined up for docs page generation as a followup, but want to make sure this goes in first.
Signed-off-by: Karen Feng <karen.feng@databricks.com>
retest this please |
* @param message C-style message format compatible with printf. | ||
* The error message is constructed by concatenating the lines with newlines. | ||
*/ | ||
case class ErrorInfo(sqlState: Option[String], message: Seq[String]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a public API, it's better to use class as case class
exposes too many APIs (apply, unapply, copy, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make ErrorInfo
a private API. These fields should be accessed from the exception type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some minor comments
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Test build #140401 has finished for PR 32850 at commit
|
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Test build #140378 has finished for PR 32850 at commit
|
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140419 has finished for PR 32850 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Last question. How do other common modules use these errors? For example: spark-unsafe, spark-network-common. |
Test build #140422 has finished for PR 32850 at commit
|
Similar problems appear in the config framework as well. spark-unsafe, spark-network-common can only hardcode config names instead of using the config framework. I think we should create a new module to contain these basic infras such as config, error systems, etc., and other modules all depend on this basic module. |
thanks, merging to master! |
What changes were proposed in this pull request?
Unifies exceptions thrown from Spark under a single base trait
SparkError
, which unifies:Why are the changes needed?
Does this PR introduce any user-facing change?
Yes, changes ODBC experience by:
How was this patch tested?
Unit tests, as well as local tests with PyODBC.