[SPARK-45022][SQL] Provide context for dataset API errors #43334

MaxGekk · 2023-10-11T19:24:33Z

What changes were proposed in this pull request?

This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.

E.g. consider the following Spark app SimpleApp.scala:

   1  import org.apache.spark.sql.SparkSession
   2  import org.apache.spark.sql.functions._
   3
   4  object SimpleApp {
   5    def main(args: Array[String]) {
   6      val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate()
   7      import spark.implicits._
   8
   9      val c = col("a") / col("b")
  10
  11      Seq((1, 0)).toDF("a", "b").select(c).show()
  12
  13      spark.stop()
  14    }
  15  }

After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:

Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== Dataset ==
"div" was called from SimpleApp$.main(SimpleApp.scala:9)

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
...

which is similar to the already provided context in case of SQL queries:

org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
a / b
^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
...

Please note that stack trace in spark-shell doesn't contain meaningful elements:

scala> Thread.currentThread().getStackTrace.foreach(println)
java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
$line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
$line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
$line15.$read$$iw$$iw$$iw.<init>(<console>:35)
$line15.$read$$iw$$iw.<init>(<console>:37)
$line15.$read$$iw.<init>(<console>:39)
$line15.$read.<init>(<console>:41)
$line15.$read$.<init>(<console>:45)
$line15.$read$.<clinit>(<console>)
$line15.$eval$.$print$lzycompute(<console>:7)
$line15.$eval$.$print(<console>:6)
$line15.$eval.$print(<console>)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...

so this change doesn't help with that usecase.

Why are the changes needed?

To provide more user friendly errors.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Added new UTs to QueryExecutionAnsiErrorsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

…-api-errors

cloud-fan · 2023-10-19T13:05:34Z

common/utils/src/main/java/org/apache/spark/QueryContext.java

@@ -45,4 +48,13 @@ public interface QueryContext {

    // The corresponding fragment of the query which throws the exception.
    String fragment();
+
+    // The Spark code (API) that caused throwing the exception.
+    String code();


shall we reuse the fragment function to return the code fragment?

cloud-fan · 2023-10-19T13:13:21Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-  def to(schema: StructType): DataFrame = withPlan {
-    val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType]
-    Project.matchSchema(logicalPlan, replaced, sparkSession.sessionState.conf)
+  def to(schema: StructType): DataFrame = withOrigin() {


I think it's good enough to attach expression call site for ansi mode, we can attach plan call site later.

Could you elaborate a little bit more, why ansi mode is important here?

When I revert changes in method like this DataFrame -> DataFrame, the fragment becomes blurry, like:
Before: select
After: anonfun$select$4

@cloud-fan For example, with withOrigin in select, we stop at index 3:

Without withOrigin in select, the picture is different. We are in withOrigin in Column's constructor, and stoped at the index 5:

cloud-fan · 2023-10-19T13:15:42Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = framesToDrop + 3
+      while (sparkCode(st(i))) i += 1


since we have this loop here, why do we still need framesToDrop?

We set framesToDrop = 1 in a few places:

Column.fn

withExpr

repartitionByExpression

repartitionByRange

withAggregateFunction

createLambda

So, there are 2 options either

the function sparkCode doesn't work properly, and we skip 1 frame forcibly

or a premature optimization.

I will check that after all tests passed eventually.

I wated to use it only for optimization, at certrain places we simply know that at least how many frames deep we are in Spark's code. sparkCode() uses regex so it can be a bit slow...

sparkCode() uses regex so it can be a bit slow.

Shouldn't be so slow, I think. Especially, just one pattern match. I'll remove the optimization so far.

…-api-errors

heyihong · 2023-10-30T13:55:19Z

One more question: does this feature work with spark connect scala client? If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors (likely in the spark connect planner) in the error message.

MaxGekk · 2023-10-30T18:01:36Z

If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors (likely in the spark connect planner) in the error message.

Don't think we should disable it even if it doesn't work. We have enough time to implement it before Spark 4.0.

cloud-fan · 2023-10-31T07:36:49Z

common/utils/src/main/java/org/apache/spark/QueryContextType.java

+/**
+ * The type of {@link QueryContext}.
+ *
+ * @since 3.5.0


cloud-fan · 2023-10-31T07:39:36Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

-  override val objectType = originObjectType.getOrElse("")
-  override val objectName = originObjectName.getOrElse("")
-  override val startIndex = originStartIndex.getOrElse(-1)
-  override val stopIndex = originStopIndex.getOrElse(-1)


why remove override?

Will revert it back.

cloud-fan · 2023-10-31T07:41:16Z

common/utils/src/main/java/org/apache/spark/QueryContextType.java

+@Evolving
+public enum QueryContextType {
+    SQL,
+    Dataset


There is no Dataset in PySpark, shall we use the name DataFrame? It also exists in Scala as a type alias of Dataset[Row]. And DataFrame is a more common name in the industry.

ok, I will rename it.

cloud-fan · 2023-10-31T07:42:14Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

+
+    builder ++= fragment
+    builder ++= "\""
+    builder ++= " was called from "


shall we add a \n before the call site?

Not sure about this. Now it looks:

== Dataset == "col" was called from org.apache.spark.sql.DatasetSuite.$anonfun$new$621(DatasetSuite.scala:2673)

but what you propose:

== Dataset == "col" was called from org.apache.spark.sql.DatasetSuite.$anonfun$new$621(DatasetSuite.scala:2673)

@cloud-fan Are you sure?

OK let's leave it as it is

cloud-fan · 2023-10-31T07:55:02Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

+      f
+    } else {
+      val st = Thread.currentThread().getStackTrace
+      var i = 3


can we add a comment to explain this magic number?

Regarding to the magic number, we always have 3 those elements at the beginning of the input array:

This has been discussed at #42740 (comment).

cloud-fan

LGTM except for some minor comments

…-api-errors

MaxGekk · 2023-11-01T07:38:31Z

Merging to master. Thank you, @peter-toth for the original PR, @cloud-fan @heyihong for review.

ryan-johnson-databricks · 2024-01-10T15:26:15Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -1572,7 +1589,9 @@ class Dataset[T] private[sql](
   * @since 2.0.0
   */
  @scala.annotation.varargs
-  def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
+  def select(col: String, cols: String*): DataFrame = withOrigin {


I don't think this is helpful -- the underlying select already has a withOrigin call, no?

We are reverting it in #44501

ryan-johnson-databricks · 2024-01-10T16:59:59Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

+      var i = 3
+      while (i < st.length && sparkCode(st(i))) i += 1
+      val origin =
+        Origin(stackTrace = Some(Thread.currentThread().getStackTrace.slice(i - 1, i + 1)))


Isn't this super expensive, calling currentThread().getStackTrace in a loop?? Can't we grab the stacktrace only once, and filter it as needed?

which loop do you mean?

itholic · 2024-04-01T06:36:45Z

If not, we probably should disable this feature for spark connect for now since customers may get confused if they see contexts for dataset API errors (likely in the spark connect planner) in the error message.

Don't think we should disable it even if it doesn't work. We have enough time to implement it before Spark 4.0.

Do we happen to have any specific plan or timeline for supporting this features for Spark Connect? Seems like it is not working both on Spark Connect Scala and Python client for now.

cloud-fan · 2024-04-01T06:51:45Z

I don't think so... We are still waiting for people who are familiar with Spark Connect to pick it up.

…taFrame API errors ### What changes were proposed in this pull request? This PR introduces an enhancement to the error messages generated by PySpark's DataFrame API, adding detailed context about the location within the user's PySpark code where the error occurred. This directly adds a PySpark user call site information into `DataFrameQueryContext` added from #43334, aiming to provide PySpark users with the same level of detailed error context for better usability and debugging efficiency for DataFrame APIs. This PR also introduces `QueryContext.pysparkCallSite` and `QueryContext.pysparkFragment` to get a PySpark information from the query context easily. This PR also enhances the functionality of `check_error` so that it can test the query context if it exists. ### Why are the changes needed? To improve a debuggability. Errors originating from PySpark operations can be difficult to debug with limited context in the error messages. While improvements on the JVM side have been made to offer detailed error contexts, PySpark errors often lack this level of detail. ### Does this PR introduce _any_ user-facing change? No API changes, but error messages will include a reference to the exact line of user code that triggered the error, in addition to the existing descriptive error message. For example, consider the following PySpark code snippet that triggers a `DIVIDE_BY_ZERO` error: ```python 1 spark.conf.set("spark.sql.ansi.enabled", True) 2 3 df = spark.range(10) 4 df.select(df.id / 0).show() ``` **Before:** ``` pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "divide" was called from java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` **After:** ``` pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "divide" was called from /.../spark/python/test_pyspark_error.py:4 ``` Now the error message points out the exact problematic code path with file name and line number that user writes. ## Points to the actual problem site instead of the site where the action was called Even when action calling after multiple transform operations are mixed, the exact problematic site can be provided to the user: **In:** ```python 1 spark.conf.set("spark.sql.ansi.enabled", True) 2 df = spark.range(10) 3 4 df1 = df.withColumn("div_ten", df.id / 10) 5 df2 = df1.withColumn("plus_four", df.id + 4) 6 7 # This is problematic divide operation that occurs DIVIDE_BY_ZERO. 8 df3 = df2.withColumn("div_zero", df.id / 0) 9 df4 = df3.withColumn("minus_five", df.id / 5) 10 11 df4.collect() ``` **Out:** ``` pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "divide" was called from /.../spark/python/test_pyspark_error.py:8 ``` ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45377 from itholic/error_context_for_dataframe_api. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-45022][SQL] Provide context for dataset API errors

1be285c

github-actions bot added SQL STRUCTURED STREAMING BUILD CORE labels Oct 11, 2023

MaxGekk added 4 commits October 12, 2023 17:19

Remove an unused import

4a3bc50

Merge remote-tracking branch 'origin/master' into context-for-dataset…

541928c

…-api-errors

Add stubs

13dc953

Merge remote-tracking branch 'origin/master' into context-for-dataset…

eccd67a

…-api-errors

github-actions bot added the CONNECT label Oct 17, 2023

Fix MiMa

6a2a33f

cloud-fan reviewed Oct 19, 2023

View reviewed changes

MaxGekk added 5 commits October 22, 2023 16:08

Update proto QueryContext

a364f29

Remove TODO

4dc6cf1

Merge remote-tracking branch 'origin/master' into context-for-dataset…

c58985a

…-api-errors

Merge remote-tracking branch 'origin/master' into context-for-dataset…

a4e2093

…-api-errors

Re-gen base_pb2.py

fe1381e

github-actions bot added the PYTHON label Oct 22, 2023

MaxGekk added 10 commits October 23, 2023 11:30

Fix a compile error

98fbc97

Fix SparkThrowableSuite

bc842a7

Merge remote-tracking branch 'origin/master' into context-for-dataset…

8de761d

…-api-errors

Merge remote-tracking branch 'origin/master' into context-for-dataset…

8e09da0

…-api-errors

Merge remote-tracking branch 'origin/master' into context-for-dataset…

1c33a00

…-api-errors

Replace code by fragment

3bf3bb9

Remove code from proto

2bfac72

Re-gen base_pb2

744b8ae

code -> fragment

651ee45

Resolve conflicts in MiMa

98a12b5

Re-gen base_pb2

4042fb4

MaxGekk requested a review from cloud-fan October 30, 2023 06:12

Merge remote-tracking branch 'origin/master' into context-for-dataset…

115eca3

…-api-errors

cloud-fan reviewed Oct 31, 2023

View reviewed changes

cloud-fan approved these changes Oct 31, 2023

View reviewed changes

MaxGekk added 5 commits October 31, 2023 11:28

Revert override

50fb94f

since 3.5.0 -> since 4.0.0

9a31a07

Rename the enum: Dataset -> DataFrame

0aa7850

Rename proto: DATASET -> DATAFRAME

fe7b4a8

Merge remote-tracking branch 'origin/master' into context-for-dataset…

9ee81a6

…-api-errors

heyihong approved these changes Oct 31, 2023

View reviewed changes

MaxGekk added 3 commits October 31, 2023 14:07

Rename the class: DatasetQueryContext -> DataFrameQueryContext

75e251d

Update summary: Dataset -> DataFrame

605fa50

Re-gen base_pb2.py

3cf20d7

MaxGekk closed this in feea99e Nov 1, 2023

HyukjinKwon mentioned this pull request Dec 14, 2023

[SPARK-45516][CONNECT] Include QueryContext in SparkThrowable proto message #43352

Closed

ryan-johnson-databricks reviewed Jan 10, 2024

View reviewed changes

itholic mentioned this pull request Mar 5, 2024

[SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45022][SQL] Provide context for dataset API errors #43334

[SPARK-45022][SQL] Provide context for dataset API errors #43334

MaxGekk commented Oct 11, 2023

cloud-fan Oct 19, 2023

MaxGekk Oct 25, 2023

cloud-fan Oct 19, 2023

MaxGekk Oct 25, 2023

MaxGekk Oct 27, 2023

MaxGekk Nov 1, 2023

cloud-fan Oct 19, 2023

MaxGekk Oct 25, 2023

peter-toth Oct 26, 2023 •

edited

Loading

MaxGekk Oct 26, 2023

heyihong commented Oct 30, 2023 •

edited

Loading

MaxGekk commented Oct 30, 2023

cloud-fan Oct 31, 2023

MaxGekk Oct 31, 2023

cloud-fan Oct 31, 2023

MaxGekk Oct 31, 2023

cloud-fan Oct 31, 2023

MaxGekk Oct 31, 2023

cloud-fan Oct 31, 2023

MaxGekk Oct 31, 2023

cloud-fan Oct 31, 2023

cloud-fan Oct 31, 2023

MaxGekk Nov 1, 2023

peter-toth Nov 1, 2023

cloud-fan left a comment

MaxGekk commented Nov 1, 2023

ryan-johnson-databricks Jan 10, 2024

cloud-fan Jan 10, 2024

ryan-johnson-databricks Jan 10, 2024

MaxGekk Jan 10, 2024

itholic commented Apr 1, 2024

cloud-fan commented Apr 1, 2024

[SPARK-45022][SQL] Provide context for dataset API errors #43334

[SPARK-45022][SQL] Provide context for dataset API errors #43334

Conversation

MaxGekk commented Oct 11, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyihong commented Oct 30, 2023 • edited Loading

MaxGekk commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

MaxGekk commented Nov 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Apr 1, 2024

cloud-fan commented Apr 1, 2024

peter-toth Oct 26, 2023 •

edited

Loading

heyihong commented Oct 30, 2023 •

edited

Loading