[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789

HyukjinKwon · 2024-05-29T11:02:31Z

What changes were proposed in this pull request?

This PR proposes to Implement DataFrameQueryContext in Spark Connect.

Add two new protobuf messages packed together with Expression:

message Origin {
  // (Required) Indicate the origin type.
  oneof function {
    PythonOrigin python_origin = 1;
  }
}

message PythonOrigin {
  // (Required) Name of the origin, for example, the name of the function
  string fragment = 1;

  // (Required) Callsite to show to end users, for example, stacktrace.
  string call_site = 2;
}

Merge DataFrameQueryContext.pysparkFragment and DataFrameQueryContext.pysparkcallSite to existing DataFrameQueryContext.fragment and DataFrameQueryContext.callSite
Separate QueryContext into SQLQueryContext and DataFrameQueryContext for consistency w/ Scala side
Implement the origin logic. current_origin thread local holds the current call site/the function name, and Expression gets it from it.
They are set to individual expression messages, and are used when analysis happens - this resembles Spark SQL implementation.

Why are the changes needed?

See #45377

Does this PR introduce any user-facing change?

Yes, same as #45377 but in Spark Connect.

How was this patch tested?

Same unittests reused in Spark Connect.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-05-29T11:03:03Z

cc @zhengruifeng @hvanhovell @itholic @cloud-fan @MaxGekk I need your look here 🙏

hvanhovell · 2024-05-29T12:34:05Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

+  }
+}
+
+message PythonOrigin {


Please don;t name stuff Python/Scala if it is not language specific.

I actually intend this to be language specific. For example, Scala side could have stacktrace chain

spark/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala

Line 138 in 8bbbde7

stackTrace: Seq[StackTraceElement],

This is Python specifically a string for now.

I mean, I don't mind combining it for now if you believe it won't be language-specific.

Why can't the Python code be modeled as a StackTraceElement? Whats the difference?

StackTraceElement has JDK dedicated method and fields (e.g., classLoaderName). I think we should have a dedicated one for individual languages.

While DataFrameQueryContext / SQLQueryContext have common information (for now), I think we will end up with having some language-specific and dedicated information for both APIs in the future.

However, I am open to having common one. There is a way to have the common (and, e.g., throw an exception if that information doesn't make sense in some languages).

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

python/pyspark/errors/utils.py

python/pyspark/sql/connect/expressions.py

cloud-fan · 2024-06-17T07:29:11Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

@@ -342,6 +344,11 @@ message Expression {
  }
 }

+message ExpressionCommon {


are we over-engineering? what else can be put here?

It's same as Relation.RelationCommon so it's more for consistency (so we can reuse Origin as well for call site). I think it's fine.

cloud-fan · 2024-06-17T07:34:58Z

python/pyspark/sql/tests/test_dataframe_query_context.py

@@ -41,7 +41,7 @@ def test_dataframe_query_context(self):
                error_class="DIVIDE_BY_ZERO",
                message_parameters={"config": '"spark.sql.ansi.enabled"'},
                query_context_type=QueryContextType.DataFrame,
-                pyspark_fragment="divide",
+                fragment="__truediv__",


hmm, is it a regression?

Ah, this is actually a bug fix. Previously it was (wrongly) checking the Scala side DataFrameQueryContext.fragment which had to be DataFrameQueryContext.pysparkFragement. Now it correctly checks DataFrameQueryContext.pysparkFragement.

This is because we now merge pysparkFragment to fragment at DataFrameQueryContext, and it correctly tests the fragment.

In the scala side, we use the call site that is closest to the user code as the fragment. I don't think __truediv__ is user-friendly.

For Java, we can find the latest stack trace that is from org.apache.spark, whose next trace is the user code. Can we do the same thing in Python?

~~Yeah, I think we should do. Let me take a look separately after this PR. This PR isn't related to the original implementation.~~

Ah, actually it points out correctly. / is a shortcut of __truediv__, and it points out the line number properly.
The missing part is to hide pyspark.*. I will take a separate look.

cc @itholic

I believe this is because the names are different in Python and Scala.

Python

spark/python/pyspark/sql/column.py

Line 115 in f0b7cfa

def __truediv__(

Scala

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala

Line 655 in f0b7cfa

def divide(other: Any): Column = this / other

Can we do the same thing in Python?

I think one way to do this is to add one dict into with_origin that can get the name of the Scala function corresponding to the Python function.

Actually we can leverage pyspark.util.try_simplify_traceback but let's do this in a separate PR.

grundprinzip · 2024-06-18T05:59:34Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/ErrorUtils.scala

-                .build())
+                .setSummary(queryCtx.summary())
+                .build()
+            } else {


Is this really an unconditional else?

For now, yes because we only have QueryContextType.SQL and QueryContextType.DataFrame.

grundprinzip · 2024-06-18T06:00:38Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/ErrorUtils.scala

+              .newBuilder()
+            val context = if (queryCtx.contextType() == QueryContextType.SQL) {
+              builder
+                .setContextType(FetchErrorDetailsResponse.QueryContext.ContextType.SQL)


did we never set this before?

Yeah .. so it has been always SQLQueryContext by default ...

grundprinzip · 2024-06-18T06:00:44Z

connector/connect/server/src/main/scala/org/apache/spark/sql/connect/utils/ErrorUtils.scala

                .setObjectType(queryCtx.objectType())
                .setObjectName(queryCtx.objectName())
                .setStartIndex(queryCtx.startIndex())
                .setStopIndex(queryCtx.stopIndex())
                .setFragment(queryCtx.fragment())
-                .build())
+                .setSummary(queryCtx.summary())


Yes, so we did not have QueryContext.sumary() API before this change.

grundprinzip · 2024-06-18T06:01:45Z

python/pyspark/errors/exceptions/captured.py

@@ -379,17 +386,12 @@ class UnknownException(CapturedException, BaseUnknownException):
    """


-class QueryContext(BaseQueryContext):
+class SQLQueryContext(BaseQueryContext):


Do we consider this a private / developer API?

Only parent class QueryContext is an API (at pyspark.errors.QueryContext) for now. This is at least consistent with Scala side.

grundprinzip · 2024-06-18T06:02:40Z

python/pyspark/errors/exceptions/captured.py

+        return str(self._q.summary())
+
+
+class DataFrameQueryContext(BaseQueryContext):


Isn't the type annotation wrong here?

I see this is for the classic side.

grundprinzip · 2024-06-18T06:08:56Z

python/pyspark/sql/connect/expressions.py

+        origin = current_origin()
+        fragment = origin.fragment
+        call_site = origin.call_site


I think this is what I don't like about this approach at least for Spark Connect. There is no need to use the thread-local for this propagation.

The outside wrapper manipulates the global state for the constructor to pick up here. Shouldn't we simply generate the origin here?

I don't quite like this too actually. Yes, we can pass the explicit callsite around. It happened to be like this to match the implementation across Catalyst Optimizer, Scala, and PySpark w/ Spark Classic.

Let me revisit this to fix up Spark Classic side together and see if we can do.

grundprinzip

I'm generally ok with the PR. My only concern is that we're using a thread local to propagate the Python query context for spark connect without a good reason.

The constructor on the DataFrame class could simply extract this from the call stack. In all honesty, this would probably be true as well for the PySpark classic class.

This side-ways loading of values is really not ideal.

HyukjinKwon · 2024-06-18T08:00:11Z

The constructor on the DataFrame class could simply extract this from the call stack. In all honesty, this would probably be true as well for the PySpark classic class.

For this, I will revisit and see if I can fix Spark Classic together.

grundprinzip

Let's make sure we're keeping track of the follow ups.

HyukjinKwon · 2024-06-18T08:25:12Z

There are three followups to make:

[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789 (comment)
[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789 (comment)
SPARK-48639: Add origin like Expression.ExpressionCommon

Working on them this week.

HyukjinKwon · 2024-06-18T08:25:19Z

Merged to master.

HyukjinKwon · 2024-06-19T07:30:22Z

python/pyspark/sql/connect/expressions.py

@@ -89,7 +90,16 @@ class Expression:
    """

    def __init__(self) -> None:
-        pass
+        origin = current_origin()
+        fragment = origin.fragment


For #46789 (comment): cc @grundprinzip

Ah, okay. Now I remember why we came this way. The problem is that there's no way to get the caller API name within, say, Expression.__init__() so we should either:

Manually pass the function name or

Wrap individual API method (but do not use threadlocal).
The initial implementation of Spark Connect was like this [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377, then I realised that this is actually error-pone, and flaky (e.g., it needed a followup: [SPARK-47852][PYTHON] Support DataFrameQueryContext for reverse operations #46053). I pointed it out here: [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377 (comment)

My take was that either 1. or 2. have the same flakiness, and replies on some manual string name so we went this way (in addition to match with Scala implementation).

HyukjinKwon · 2024-06-19T07:33:19Z

For #46789 (comment), #47009 (WIP) cc @cloud-fan

HyukjinKwon · 2024-06-19T07:41:01Z

For SPARK-48639: #47024 cc @grundprinzip

### What changes were proposed in this pull request? This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon` ### Why are the changes needed? To have the common protobuf message to keep the source code info. ### Does this PR introduce _any_ user-facing change? No. This is not used. ### How was this patch tested? CI should validate protobuf definition, and exiting tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47024 from HyukjinKwon/SPARK-48639. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon` To have the common protobuf message to keep the source code info. No. This is not used. CI should validate protobuf definition, and exiting tests should pass. No. Closes #47024 from HyukjinKwon/SPARK-48639. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 09cb592) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon` This is a revert of the revert. ### Why are the changes needed? To have the common protobuf message to keep the source code info. ```diff - // TODO(SPARK-48639): Add origin like Expression.ExpressionCommon - - // (Required) Shared relation metadata. - string source_info = 1; + // (Optional) Shared relation metadata. + reserved 1; ``` This is considered as a breaking change, and we should fix up all other branches down to avoid, which isn't really worthwhile. ### Does this PR introduce _any_ user-facing change? No. This is not used. ### How was this patch tested? CI should validate protobuf definition, and exiting tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47115 from HyukjinKwon/SPARK-48639-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL PYTHON CONNECT labels May 29, 2024

hvanhovell reviewed May 29, 2024

View reviewed changes

HyukjinKwon force-pushed the connect-context branch 2 times, most recently from 4c0d755 to a5fcb8d Compare May 30, 2024 00:03

itholic approved these changes May 30, 2024

View reviewed changes

zhengruifeng approved these changes May 30, 2024

View reviewed changes

HyukjinKwon force-pushed the connect-context branch from a5fcb8d to 0c586b3 Compare June 2, 2024 23:19

grundprinzip reviewed Jun 14, 2024

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto Outdated Show resolved Hide resolved

grundprinzip reviewed Jun 14, 2024

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto Outdated Show resolved Hide resolved

grundprinzip reviewed Jun 14, 2024

View reviewed changes

python/pyspark/errors/utils.py Show resolved Hide resolved

grundprinzip reviewed Jun 14, 2024

View reviewed changes

python/pyspark/sql/connect/expressions.py Outdated Show resolved Hide resolved

HyukjinKwon marked this pull request as draft June 17, 2024 01:39

Implement DataFrameQueryContext in Spark Connect

853fe82

HyukjinKwon force-pushed the connect-context branch from 0c586b3 to 646ba23 Compare June 17, 2024 02:10

HyukjinKwon marked this pull request as ready for review June 17, 2024 02:10

Address comments

62ac773

HyukjinKwon force-pushed the connect-context branch from 646ba23 to 62ac773 Compare June 17, 2024 02:25

cloud-fan reviewed Jun 17, 2024

View reviewed changes

grundprinzip reviewed Jun 18, 2024

View reviewed changes

grundprinzip approved these changes Jun 18, 2024

View reviewed changes

HyukjinKwon closed this in 80bba44 Jun 18, 2024

HyukjinKwon commented Jun 19, 2024

View reviewed changes

HyukjinKwon mentioned this pull request Jun 19, 2024

[SPARK-48639][CONNECT][PYTHON] Add Origin to Relation.RelationCommon #47024

Closed

HyukjinKwon mentioned this pull request Jun 27, 2024

[SPARK-48639][CONNECT][PYTHON] Add Origin to RelationCommon #47115

Closed

		return str(self._q.summary())


		class DataFrameQueryContext(BaseQueryContext):

[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789

[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789

Conversation

HyukjinKwon commented May 29, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grundprinzip Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grundprinzip left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 18, 2024

grundprinzip left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 18, 2024

HyukjinKwon commented Jun 18, 2024

HyukjinKwon Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Jun 19, 2024 • edited Loading

HyukjinKwon commented Jun 19, 2024

HyukjinKwon commented May 29, 2024 •

edited

Loading

HyukjinKwon Jun 17, 2024 •

edited

Loading

HyukjinKwon Jun 17, 2024 •

edited

Loading

grundprinzip Jun 18, 2024 •

edited

Loading

HyukjinKwon Jun 19, 2024 •

edited

Loading

HyukjinKwon commented Jun 19, 2024 •

edited

Loading