Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45022][SQL] Provide context for dataset API errors #42740

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Aug 30, 2023

What changes were proposed in this pull request?

This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.

E.g. consider the following Spark app SimpleApp.scala:

   1  import org.apache.spark.sql.SparkSession
   2  import org.apache.spark.sql.functions._
   3
   4  object SimpleApp {
   5    def main(args: Array[String]) {
   6      val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate()
   7      import spark.implicits._
   8
   9      val c = col("a") / col("b")
  10
  11      Seq((1, 0)).toDF("a", "b").select(c).show()
  12
  13      spark.stop()
  14    }
  15  }

After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:

Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== Dataset ==
"div" was called from SimpleApp$.main(SimpleApp.scala:9)

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
...

which is similar to the already provided context in case of SQL queries:

org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
a / b
^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
...

Please note that stack trace in spark-shell doesn't contain meaningful elements:

scala> Thread.currentThread().getStackTrace.foreach(println)
java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
$line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
$line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
$line15.$read$$iw$$iw$$iw.<init>(<console>:35)
$line15.$read$$iw$$iw.<init>(<console>:37)
$line15.$read$$iw.<init>(<console>:39)
$line15.$read.<init>(<console>:41)
$line15.$read$.<init>(<console>:45)
$line15.$read$.<clinit>(<console>)
$line15.$eval$.$print$lzycompute(<console>:7)
$line15.$eval$.$print(<console>:6)
$line15.$eval.$print(<console>)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...

so this change doesn't help with that usecase.

Why are the changes needed?

To provide more user friendly errors.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added new UTs to QueryExecutionAnsiErrorsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@peter-toth peter-toth changed the title [SPARK-][SQL] Provide context for dataset API errors [SPARK-45022][SQL] Provide context for dataset API errors Aug 30, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 628f70d to 5cff672 Compare August 30, 2023 15:32
@peter-toth peter-toth changed the title [SPARK-45022][SQL] Provide context for dataset API errors [WIP][SPARK-45022][SQL] Provide context for dataset API errors Aug 31, 2023
@@ -241,29 +249,35 @@ object functions {
* @since 1.3.0
*/
@deprecated("Use approx_count_distinct", "2.1.0")
def approxCountDistinct(e: Column): Column = approx_count_distinct(e)
def approxCountDistinct(e: Column): Column = withAggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we need to touch a lot of functions anyway, here is a long-standing cleanup work that is not finished: most of the functions can be implemented by creating UnresolvedFunction. We don't need to create the function expression directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will update the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#42864 is merged now and this PR is ready for review.

@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from 6d218e4 to b2d8750 Compare September 4, 2023 13:34
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from b2d8750 to 7556ffa Compare September 5, 2023 07:14
@peter-toth peter-toth changed the title [WIP][SPARK-45022][SQL] Provide context for dataset API errors [WIP][SPARK-45022][SQL][test-java11] Provide context for dataset API errors Sep 5, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 7556ffa to 289222f Compare September 5, 2023 11:04
@github-actions github-actions bot added the INFRA label Sep 5, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from fadd5e9 to 9a8cb86 Compare September 5, 2023 14:26
@github-actions github-actions bot removed the INFRA label Sep 5, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from f1bb66e to f390e73 Compare September 7, 2023 11:41
@github-actions github-actions bot removed the CONNECT label Sep 7, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from e00a6a7 to 8b535be Compare September 7, 2023 17:24
@peter-toth peter-toth changed the title [WIP][SPARK-45022][SQL][test-java11] Provide context for dataset API errors [WIP][SPARK-45022][SQL] Provide context for dataset API errors Sep 8, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from f487a76 to 62c4a98 Compare September 9, 2023 14:09
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from e8d644e to 26ba6d7 Compare September 9, 2023 15:27
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 26ba6d7 to 7adf30e Compare September 14, 2023 11:40
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 7adf30e to 5b774a4 Compare September 17, 2023 13:36
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from cfa0332 to cb936a6 Compare September 20, 2023 12:27
@github-actions github-actions bot added the BUILD label Sep 20, 2023
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from cb936a6 to 935e74a Compare September 21, 2023 14:02
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from cabd982 to f0a8e18 Compare September 24, 2023 08:58
@peter-toth peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from f0a8e18 to 42028a6 Compare September 24, 2023 12:12
@peter-toth peter-toth changed the title [WIP][SPARK-45022][SQL] Provide context for dataset API errors [SPARK-45022][SQL] Provide context for dataset API errors Sep 25, 2023

override val objectType = originObjectType.getOrElse("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove override?

Copy link
Contributor Author

@peter-toth peter-toth Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this chnage accidentally remained here from a different implementation. I will revert it.

} else {
methodName
}
val callSite = elements(1).toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a followup: make the size of the call site configurable.

@@ -947,8 +951,10 @@ class Dataset[T] private[sql](
* @group untypedrel
* @since 2.0.0
*/
def join(right: Dataset[_]): DataFrame = withPlan {
Join(logicalPlan, right.logicalPlan, joinType = Inner, None, JoinHint.NONE)
def join(right: Dataset[_]): DataFrame = withOrigin() {
Copy link
Contributor

@cloud-fan cloud-fan Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this feature is mostly for ansi mode, do we really need to capture the call site for df APIs that do not create expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to, but please note that query context can be useful with non-ANSI related erros too. here is an example with over(): https://github.com/apache/spark/pull/42740/files#diff-728befc0ad63fec7c44a606c62f3f63204293d3ca87da2830fde8ae7a291656bR188-R189

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible in a central place? like withPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some methods' stacktrace can be captured in withPlan but others' can't that's why I decided to add withOrigin to all top level methods.

f
} else {
val st = Thread.currentThread().getStackTrace
var i = framesToDrop + 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread.currentThread() and withOrigin() add 2 fames and because user code can't call withOrigin() directly we can be sure that 3rd frame also belongs to spark code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

@MaxGekk
Copy link
Member

MaxGekk commented Oct 11, 2023

@peter-toth Wouldn't you mind if I'll take this PR over?

@peter-toth
Copy link
Contributor Author

@peter-toth Wouldn't you mind if I'll take this PR over?

Sure, please do it.

@peter-toth peter-toth closed this Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants