[SPARK-45022][SQL] Provide context for dataset API errors #42740

peter-toth · 2023-08-30T15:25:16Z

What changes were proposed in this pull request?

This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.

E.g. consider the following Spark app SimpleApp.scala:

   1  import org.apache.spark.sql.SparkSession
   2  import org.apache.spark.sql.functions._
   3
   4  object SimpleApp {
   5    def main(args: Array[String]) {
   6      val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate()
   7      import spark.implicits._
   8
   9      val c = col("a") / col("b")
  10
  11      Seq((1, 0)).toDF("a", "b").select(c).show()
  12
  13      spark.stop()
  14    }
  15  }

After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:

Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== Dataset ==
"div" was called from SimpleApp$.main(SimpleApp.scala:9)

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
...

which is similar to the already provided context in case of SQL queries:

org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
a / b
^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
...

Please note that stack trace in spark-shell doesn't contain meaningful elements:

scala> Thread.currentThread().getStackTrace.foreach(println)
java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
$line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
$line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
$line15.$read$$iw$$iw$$iw.<init>(<console>:35)
$line15.$read$$iw$$iw.<init>(<console>:37)
$line15.$read$$iw.<init>(<console>:39)
$line15.$read.<init>(<console>:41)
$line15.$read$.<init>(<console>:45)
$line15.$read$.<clinit>(<console>)
$line15.$eval$.$print$lzycompute(<console>:7)
$line15.$eval$.$print(<console>:6)
$line15.$eval.$print(<console>)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...

so this change doesn't help with that usecase.

Why are the changes needed?

To provide more user friendly errors.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added new UTs to QueryExecutionAnsiErrorsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

cloud-fan · 2023-08-31T13:54:05Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -241,29 +249,35 @@ object functions {
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
-  def approxCountDistinct(e: Column): Column = approx_count_distinct(e)
+  def approxCountDistinct(e: Column): Column = withAggregateFunction {


since we need to touch a lot of functions anyway, here is a long-standing cleanup work that is not finished: most of the functions can be implemented by creating UnresolvedFunction. We don't need to create the function expression directly.

Ok, I will update the PR.

#42864 is merged now and this PR is ready for review.

cloud-fan · 2023-09-26T14:46:55Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala


-  override val objectType = originObjectType.getOrElse("")


why remove override?

this chnage accidentally remained here from a different implementation. I will revert it.

cloud-fan · 2023-09-26T14:50:46Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

+    } else {
+      methodName
+    }
+    val callSite = elements(1).toString


a followup: make the size of the call site configurable.

cloud-fan · 2023-09-26T14:57:06Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -947,8 +951,10 @@ class Dataset[T] private[sql](
   * @group untypedrel
   * @since 2.0.0
   */
-  def join(right: Dataset[_]): DataFrame = withPlan {
-    Join(logicalPlan, right.logicalPlan, joinType = Inner, None, JoinHint.NONE)
+  def join(right: Dataset[_]): DataFrame = withOrigin() {


since this feature is mostly for ansi mode, do we really need to capture the call site for df APIs that do not create expressions?

we don't need to, but please note that query context can be useful with non-ANSI related erros too. here is an example with over(): https://github.com/apache/spark/pull/42740/files#diff-728befc0ad63fec7c44a606c62f3f63204293d3ca87da2830fde8ae7a291656bR188-R189

Is it possible in a central place? like withPlan?

Some methods' stacktrace can be captured in withPlan but others' can't that's why I decided to add withOrigin to all top level methods.

cloud-fan · 2023-09-26T14:58:50Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3


Thread.currentThread() and withOrigin() add 2 fames and because user code can't call withOrigin() directly we can be sure that 3rd frame also belongs to spark code.

MaxGekk · 2023-10-11T12:58:12Z

@peter-toth Wouldn't you mind if I'll take this PR over?

peter-toth · 2023-10-11T13:00:40Z

@peter-toth Wouldn't you mind if I'll take this PR over?

Sure, please do it.

github-actions bot added SQL CORE labels Aug 30, 2023

peter-toth changed the title ~~[SPARK-][SQL] Provide context for dataset API errors~~ [SPARK-45022][SQL] Provide context for dataset API errors Aug 30, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 628f70d to 5cff672 Compare August 30, 2023 15:32

peter-toth requested review from cloud-fan and MaxGekk August 30, 2023 15:33

peter-toth changed the title ~~[SPARK-45022][SQL] Provide context for dataset API errors~~ [WIP][SPARK-45022][SQL] Provide context for dataset API errors Aug 31, 2023

cloud-fan reviewed Aug 31, 2023

View reviewed changes

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from 6d218e4 to b2d8750 Compare September 4, 2023 13:34

github-actions bot added the CONNECT label Sep 4, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from b2d8750 to 7556ffa Compare September 5, 2023 07:14

peter-toth changed the title ~~[WIP][SPARK-45022][SQL] Provide context for dataset API errors~~ [WIP][SPARK-45022][SQL][test-java11] Provide context for dataset API errors Sep 5, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 7556ffa to 289222f Compare September 5, 2023 11:04

github-actions bot added the INFRA label Sep 5, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from fadd5e9 to 9a8cb86 Compare September 5, 2023 14:26

github-actions bot removed the INFRA label Sep 5, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from f1bb66e to f390e73 Compare September 7, 2023 11:41

github-actions bot removed the CONNECT label Sep 7, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from e00a6a7 to 8b535be Compare September 7, 2023 17:24

peter-toth changed the title ~~[WIP][SPARK-45022][SQL][test-java11] Provide context for dataset API errors~~ [WIP][SPARK-45022][SQL] Provide context for dataset API errors Sep 8, 2023

github-actions bot added STRUCTURED STREAMING PYTHON labels Sep 8, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from f487a76 to 62c4a98 Compare September 9, 2023 14:09

github-actions bot added CONNECT and removed STRUCTURED STREAMING PYTHON labels Sep 9, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from e8d644e to 26ba6d7 Compare September 9, 2023 15:27

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 26ba6d7 to 7adf30e Compare September 14, 2023 11:40

github-actions bot added STRUCTURED STREAMING PYTHON R and removed CONNECT labels Sep 14, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from 7adf30e to 5b774a4 Compare September 17, 2023 13:36

github-actions bot added the CONNECT label Sep 17, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from cfa0332 to cb936a6 Compare September 20, 2023 12:27

github-actions bot added the BUILD label Sep 20, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from cb936a6 to 935e74a Compare September 21, 2023 14:02

github-actions bot removed STRUCTURED STREAMING PYTHON R CONNECT labels Sep 21, 2023

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch 2 times, most recently from cabd982 to f0a8e18 Compare September 24, 2023 08:58

github-actions bot added the STRUCTURED STREAMING label Sep 24, 2023

[SPARK-45022][SQL] Provide context for dataset API errors

42028a6

peter-toth force-pushed the SPARK-45022-context-for-dataset-api-errors branch from f0a8e18 to 42028a6 Compare September 24, 2023 12:12

peter-toth changed the title ~~[WIP][SPARK-45022][SQL] Provide context for dataset API errors~~ [SPARK-45022][SQL] Provide context for dataset API errors Sep 25, 2023

cloud-fan reviewed Sep 26, 2023

View reviewed changes

peter-toth closed this Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45022][SQL] Provide context for dataset API errors #42740

[SPARK-45022][SQL] Provide context for dataset API errors #42740

peter-toth commented Aug 30, 2023 •

edited

Loading

cloud-fan Aug 31, 2023

peter-toth Aug 31, 2023

peter-toth Sep 25, 2023

cloud-fan Sep 26, 2023

peter-toth Sep 26, 2023 •

edited

Loading

cloud-fan Sep 26, 2023

cloud-fan Sep 26, 2023 •

edited

Loading

peter-toth Sep 26, 2023

cloud-fan Sep 27, 2023

peter-toth Sep 29, 2023

cloud-fan Sep 26, 2023

peter-toth Sep 26, 2023

cloud-fan Sep 26, 2023

MaxGekk commented Oct 11, 2023

peter-toth commented Oct 11, 2023

[SPARK-45022][SQL] Provide context for dataset API errors #42740

[SPARK-45022][SQL] Provide context for dataset API errors #42740

Conversation

peter-toth commented Aug 30, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Oct 11, 2023

peter-toth commented Oct 11, 2023

peter-toth commented Aug 30, 2023 •

edited

Loading

peter-toth Sep 26, 2023 •

edited

Loading

cloud-fan Sep 26, 2023 •

edited

Loading