-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45022][SQL] Provide context for dataset API errors #42740
[SPARK-45022][SQL] Provide context for dataset API errors #42740
Conversation
628f70d
to
5cff672
Compare
@@ -241,29 +249,35 @@ object functions { | |||
* @since 1.3.0 | |||
*/ | |||
@deprecated("Use approx_count_distinct", "2.1.0") | |||
def approxCountDistinct(e: Column): Column = approx_count_distinct(e) | |||
def approxCountDistinct(e: Column): Column = withAggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we need to touch a lot of functions anyway, here is a long-standing cleanup work that is not finished: most of the functions can be implemented by creating UnresolvedFunction
. We don't need to create the function expression directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#42864 is merged now and this PR is ready for review.
6d218e4
to
b2d8750
Compare
b2d8750
to
7556ffa
Compare
7556ffa
to
289222f
Compare
fadd5e9
to
9a8cb86
Compare
f1bb66e
to
f390e73
Compare
e00a6a7
to
8b535be
Compare
f487a76
to
62c4a98
Compare
e8d644e
to
26ba6d7
Compare
26ba6d7
to
7adf30e
Compare
7adf30e
to
5b774a4
Compare
cfa0332
to
cb936a6
Compare
cb936a6
to
935e74a
Compare
cabd982
to
f0a8e18
Compare
f0a8e18
to
42028a6
Compare
|
||
override val objectType = originObjectType.getOrElse("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove override
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this chnage accidentally remained here from a different implementation. I will revert it.
} else { | ||
methodName | ||
} | ||
val callSite = elements(1).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a followup: make the size of the call site configurable.
@@ -947,8 +951,10 @@ class Dataset[T] private[sql]( | |||
* @group untypedrel | |||
* @since 2.0.0 | |||
*/ | |||
def join(right: Dataset[_]): DataFrame = withPlan { | |||
Join(logicalPlan, right.logicalPlan, joinType = Inner, None, JoinHint.NONE) | |||
def join(right: Dataset[_]): DataFrame = withOrigin() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this feature is mostly for ansi mode, do we really need to capture the call site for df APIs that do not create expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to, but please note that query context can be useful with non-ANSI related erros too. here is an example with over()
: https://github.com/apache/spark/pull/42740/files#diff-728befc0ad63fec7c44a606c62f3f63204293d3ca87da2830fde8ae7a291656bR188-R189
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible in a central place? like withPlan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some methods' stacktrace can be captured in withPlan
but others' can't that's why I decided to add withOrigin
to all top level methods.
f | ||
} else { | ||
val st = Thread.currentThread().getStackTrace | ||
var i = framesToDrop + 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread.currentThread()
and withOrigin()
add 2 fames and because user code can't call withOrigin()
directly we can be sure that 3rd frame also belongs to spark code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
@peter-toth Wouldn't you mind if I'll take this PR over? |
Sure, please do it. |
What changes were proposed in this pull request?
This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.
E.g. consider the following Spark app
SimpleApp.scala
:After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:
which is similar to the already provided context in case of SQL queries:
Please note that stack trace in
spark-shell
doesn't contain meaningful elements:so this change doesn't help with that usecase.
Why are the changes needed?
To provide more user friendly errors.
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Added new UTs to
QueryExecutionAnsiErrorsSuite
.Was this patch authored or co-authored using generative AI tooling?
No.