Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48459][CONNECT][PYTHON] Implement DataFrameQueryContext in Spark Connect #46789

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 29, 2024

What changes were proposed in this pull request?

This PR proposes to Implement DataFrameQueryContext in Spark Connect.

  1. Add two new protobuf messages packed together with Expression:

    message Origin {
      // (Required) Indicate the origin type.
      oneof function {
        PythonOrigin python_origin = 1;
      }
    }
    
    message PythonOrigin {
      // (Required) Name of the origin, for example, the name of the function
      string fragment = 1;
    
      // (Required) Callsite to show to end users, for example, stacktrace.
      string call_site = 2;
    }
  2. Merge DataFrameQueryContext.pysparkFragment and DataFrameQueryContext.pysparkcallSite to existing DataFrameQueryContext.fragment and DataFrameQueryContext.callSite

  3. Separate QueryContext into SQLQueryContext and DataFrameQueryContext for consistency w/ Scala side

  4. Implement the origin logic. current_origin thread local holds the current call site/the function name, and Expression gets it from it.
    They are set to individual expression messages, and are used when analysis happens - this resembles Spark SQL implementation.

See also #45377.

Why are the changes needed?

See #45377

Does this PR introduce any user-facing change?

Yes, same as #45377 but in Spark Connect.

How was this patch tested?

Same unittests reused in Spark Connect.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @hvanhovell @itholic @cloud-fan @MaxGekk I need your look here 🙏

}
}

message PythonOrigin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don;t name stuff Python/Scala if it is not language specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually intend this to be language specific. For example, Scala side could have stacktrace chain

This is Python specifically a string for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I don't mind combining it for now if you believe it won't be language-specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the Python code be modeled as a StackTraceElement? Whats the difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StackTraceElement has JDK dedicated method and fields (e.g., classLoaderName). I think we should have a dedicated one for individual languages.

While DataFrameQueryContext / SQLQueryContext have common information (for now), I think we will end up with having some language-specific and dedicated information for both APIs in the future.

However, I am open to having common one. There is a way to have the common (and, e.g., throw an exception if that information doesn't make sense in some languages).

@HyukjinKwon HyukjinKwon force-pushed the connect-context branch 2 times, most recently from 4c0d755 to a5fcb8d Compare May 30, 2024 00:03
@HyukjinKwon HyukjinKwon marked this pull request as draft June 17, 2024 01:39
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 17, 2024 02:10
@@ -342,6 +344,11 @@ message Expression {
}
}

message ExpressionCommon {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we over-engineering? what else can be put here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's same as Relation.RelationCommon so it's more for consistency (so we can reuse Origin as well for call site). I think it's fine.

@@ -41,7 +41,7 @@ def test_dataframe_query_context(self):
error_class="DIVIDE_BY_ZERO",
message_parameters={"config": '"spark.sql.ansi.enabled"'},
query_context_type=QueryContextType.DataFrame,
pyspark_fragment="divide",
fragment="__truediv__",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it a regression?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is actually a bug fix. Previously it was (wrongly) checking the Scala side DataFrameQueryContext.fragment which had to be DataFrameQueryContext.pysparkFragement. Now it correctly checks DataFrameQueryContext.pysparkFragement.

This is because we now merge pysparkFragment to fragment at DataFrameQueryContext, and it correctly tests the fragment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scala side, we use the call site that is closest to the user code as the fragment. I don't think __truediv__ is user-friendly.

For Java, we can find the latest stack trace that is from org.apache.spark, whose next trace is the user code. Can we do the same thing in Python?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should do. Let me take a look separately after this PR. This PR isn't related to the original implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually it points out correctly. / is a shortcut of __truediv__, and it points out the line number properly.
The missing part is to hide pyspark.*. I will take a separate look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is because the names are different in Python and Scala.

Python

def __truediv__(

Scala

def divide(other: Any): Column = this / other

Can we do the same thing in Python?

I think one way to do this is to add one dict into with_origin that can get the name of the Scala function corresponding to the Python function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can leverage pyspark.util.try_simplify_traceback but let's do this in a separate PR.

.build())
.setSummary(queryCtx.summary())
.build()
} else {
Copy link
Contributor

@grundprinzip grundprinzip Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really an unconditional else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes because we only have QueryContextType.SQL and QueryContextType.DataFrame.

.newBuilder()
val context = if (queryCtx.contextType() == QueryContextType.SQL) {
builder
.setContextType(FetchErrorDetailsResponse.QueryContext.ContextType.SQL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we never set this before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah .. so it has been always SQLQueryContext by default ...

.setObjectType(queryCtx.objectType())
.setObjectName(queryCtx.objectName())
.setStartIndex(queryCtx.startIndex())
.setStopIndex(queryCtx.stopIndex())
.setFragment(queryCtx.fragment())
.build())
.setSummary(queryCtx.summary())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so we did not have QueryContext.sumary() API before this change.

@@ -379,17 +386,12 @@ class UnknownException(CapturedException, BaseUnknownException):
"""


class QueryContext(BaseQueryContext):
class SQLQueryContext(BaseQueryContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider this a private / developer API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only parent class QueryContext is an API (at pyspark.errors.QueryContext) for now. This is at least consistent with Scala side.

return str(self._q.summary())


class DataFrameQueryContext(BaseQueryContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the type annotation wrong here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is for the classic side.

Comment on lines +93 to +95
origin = current_origin()
fragment = origin.fragment
call_site = origin.call_site
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I don't like about this approach at least for Spark Connect. There is no need to use the thread-local for this propagation.

The outside wrapper manipulates the global state for the constructor to pick up here. Shouldn't we simply generate the origin here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like this too actually. Yes, we can pass the explicit callsite around. It happened to be like this to match the implementation across Catalyst Optimizer, Scala, and PySpark w/ Spark Classic.

Let me revisit this to fix up Spark Classic side together and see if we can do.

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally ok with the PR. My only concern is that we're using a thread local to propagate the Python query context for spark connect without a good reason.

The constructor on the DataFrame class could simply extract this from the call stack. In all honesty, this would probably be true as well for the PySpark classic class.

This side-ways loading of values is really not ideal.

@HyukjinKwon
Copy link
Member Author

The constructor on the DataFrame class could simply extract this from the call stack. In all honesty, this would probably be true as well for the PySpark classic class.

For this, I will revisit and see if I can fix Spark Classic together.

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we're keeping track of the follow ups.

@HyukjinKwon
Copy link
Member Author

@HyukjinKwon
Copy link
Member Author

Merged to master.

@@ -89,7 +90,16 @@ class Expression:
"""

def __init__(self) -> None:
pass
origin = current_origin()
fragment = origin.fragment
Copy link
Member Author

@HyukjinKwon HyukjinKwon Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #46789 (comment): cc @grundprinzip

Ah, okay. Now I remember why we came this way. The problem is that there's no way to get the caller API name within, say, Expression.__init__() so we should either:

  1. Manually pass the function name or
  2. Wrap individual API method (but do not use threadlocal).
    The initial implementation of Spark Connect was like this [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377, then I realised that this is actually error-pone, and flaky (e.g., it needed a followup: [SPARK-47852][PYTHON] Support DataFrameQueryContext for reverse operations #46053). I pointed it out here: [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors #45377 (comment)

My take was that either 1. or 2. have the same flakiness, and replies on some manual string name so we went this way (in addition to match with Scala implementation).

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 19, 2024

For #46789 (comment), #47009 (WIP) cc @cloud-fan

@HyukjinKwon
Copy link
Member Author

For SPARK-48639: #47024 cc @grundprinzip

HyukjinKwon added a commit that referenced this pull request Jun 24, 2024
### What changes were proposed in this pull request?

This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon`

### Why are the changes needed?

To have the common protobuf message to keep the source code info.

### Does this PR introduce _any_ user-facing change?

No. This is not used.

### How was this patch tested?

CI should validate protobuf definition, and exiting tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47024 from HyukjinKwon/SPARK-48639.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Jun 24, 2024
This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon`

To have the common protobuf message to keep the source code info.

No. This is not used.

CI should validate protobuf definition, and exiting tests should pass.

No.

Closes #47024 from HyukjinKwon/SPARK-48639.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 09cb592)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Jun 27, 2024
### What changes were proposed in this pull request?

This PR proposes to add `Origin` (from #46789) to `Relation.RelationCommon`
This is a revert of the revert.

### Why are the changes needed?

To have the common protobuf message to keep the source code info.

```diff
-  // TODO(SPARK-48639): Add origin like Expression.ExpressionCommon
-
-  // (Required) Shared relation metadata.
-  string source_info = 1;
+  // (Optional) Shared relation metadata.
+  reserved 1;
```
This is considered as a breaking change, and we should fix up all other branches down to avoid, which isn't really worthwhile.

### Does this PR introduce _any_ user-facing change?

No. This is not used.

### How was this patch tested?

CI should validate protobuf definition, and exiting tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47115 from HyukjinKwon/SPARK-48639-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants