-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54020] Support spark.sql(...) Python API inside query functions for Spark Declarative Pipeline
#53024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54020] Support spark.sql(...) Python API inside query functions for Spark Declarative Pipeline
#53024
Conversation
sryza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of code cleanup changes here that seem great but are outside the critical path of the main goal of this PR (supporting spark.sql inside pipelines). Would it be difficult to move those changes into a separate PR to reduce risk?
@sryza I will polish up the PR more to reflect this but unfortunately I think most of the changes are necessary:
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @SCHJonathan .
If you don't mind, please file a JIRA issue in the ASF community repository. It will help you prevent potential accidents like the following commits.
|
@dongjoon-hyun Absolutely! Thanks for the reminding! |
|
I think there's an existing JIRA for this: https://issues.apache.org/jira/browse/SPARK-54020 |
Please lead the contributor to update the PR title before starting review, @sryza . |
17800b5 to
e515b85
Compare
spark.sql(...) Python API for Spark Declarative Pipeline
spark.sql(...) Python API for Spark Declarative Pipelinespark.sql(...) Python API inside query functions for Spark Declarative Pipeline
sryza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @SCHJonathan ! This closes a significant issue with declarative pipelines that would be great to get in this week and have fixed for Spark 4.1.
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala
Outdated
Show resolved
Hide resolved
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala
Outdated
Show resolved
Hide resolved
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala
Outdated
Show resolved
Hide resolved
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala
Outdated
Show resolved
Hide resolved
sryza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'll wait until the underlying user context extensions PR gets merged before merging this.
|
|
||
|
|
||
| @contextmanager | ||
| def block_spark_connect_execution_and_analysis() -> Generator[None, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the context on the server side, it might make more sense to block these operations there – then we don't need to replicate this weird monkeypatching logic across all the clients when we add support for other languages. Doesn't need to be part of this PR though.
|
Could you make the CI happy, @SCHJonathan ? There are several test pipeline failures. |
Absolutely, just fixed |
|
Thank you. Could you check these too? |
|
Sounds good to me, too. |
|
It turns out we revealed another failures which |
|
|
|
@dongjoon-hyun , validate the CI fix works. File a ticket and create a CI fix PR: #53058 |
…ndard==0.25.0` ### What changes were proposed in this pull request? In #53024 (comment), PR CI Python unit tests failed due to ``` pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] zstandard >= 0.25.0 must be installed; however, it was not found. ``` This PR add the required dependency to the pre-merge CI. ### Why are the changes needed? Recover Python unit tests CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? PR #53024 Python CI back to healthy with this change ### Was this patch authored or co-authored using generative AI tooling? No Closes #53058 from SCHJonathan/jonathan-chang_data/fix-python-ci-dep. Authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ndard==0.25.0` ### What changes were proposed in this pull request? In #53024 (comment), PR CI Python unit tests failed due to ``` pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] zstandard >= 0.25.0 must be installed; however, it was not found. ``` This PR add the required dependency to the pre-merge CI. ### Why are the changes needed? Recover Python unit tests CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? PR #53024 Python CI back to healthy with this change ### Was this patch authored or co-authored using generative AI tooling? No Closes #53058 from SCHJonathan/jonathan-chang_data/fix-python-ci-dep. Authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a916690) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
| test("reading external datasets outside query function works") { | ||
| sql("CREATE TABLE spark_catalog.default.src AS SELECT * FROM RANGE(5)") | ||
| val graph = buildGraph(s""" | ||
| |spark_sql_df = spark.sql("SELECT * FROM spark_catalog.default.src") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation?
...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala
Outdated
Show resolved
Hide resolved
...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala
Outdated
Show resolved
Hide resolved
...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala
Outdated
Show resolved
Hide resolved
|
@sryza @dongjoon-hyun Hi all, managed to get a all green CI, and it's ready to be merged |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @SCHJonathan and @sryza .
Merged to master/4.1 for Apache Spark 4.1.0.
…ons for Spark Declarative Pipeline
### What changes were proposed in this pull request?
This PR adds support for `spark.sql(...)` Python API inside query functions for Spark Declarative Pipelines. Users can now use `spark.sql(...)` to define query functions, and dependencies are correctly tracked.
**Example usage:**
```python
dp.materialized_view()
def source():
return spark.sql("SELECT * FROM RANGE(5)")
dp.materialized_view()
def target():
return spark.sql("SELECT * FROM source")
```
This PR also adds restrictions on the set of SQL commands users can execute. Unsupported commands (e.g., `spark.sql("CREATE TABLE ...")`) inside query functions will raise an error.
**Implementation details:**
1. Added `PipelineAnalysisContext` to Spark Connect's user context extensions, enabling the server to identify requests originating from Spark Declarative Pipelines and apply appropriate restrictions.
2. The `flow_name` field in `PipelineAnalysisContext` determines execution behavior:
- **Inside query functions** (`flow_name` is set): Spark Connect server treats `spark.sql()` as a no-op and returns the raw logical plan to SDP for deferred analysis as part of the Dataflow Graph.
- **Outside query functions** (`flow_name` is empty): Spark Connect server eagerly executes the command, but only SDP-allowlisted commands are permitted.
### Why are the changes needed?
`spark.sql(...)` is a common and intuitive pattern for users who are more familiar with SQL to define query functions. Supporting this API improves usability and allows SQL-first developers to work more naturally with Spark Declarative Pipelines.
### Does this PR introduce _any_ user-facing change?
Yes. Previously, `spark.sql(...)` inside query functions was not supported and users would see an `ATTEMPT_ANALYSIS_IN_PIPELINE_QUERY_FUNCTION` exception. This PR lifts that restriction.
### How was this patch tested?
New test cases in `PythonPipelineSuite` unit test
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #53024 from SCHJonathan/jonathan-chang_data/spark-sql.
Authored-by: Yuheng Chang <jonathanyuheng@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit cc72c64)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
@SCHJonathan . What is your Apache JIRA ID? I want to assign SPARK-54020 to you. |
What changes were proposed in this pull request?
This PR adds support for
spark.sql(...)Python API inside query functions for Spark Declarative Pipelines. Users can now usespark.sql(...)to define query functions, and dependencies are correctly tracked.Example usage:
This PR also adds restrictions on the set of SQL commands users can execute. Unsupported commands (e.g.,
spark.sql("CREATE TABLE ...")) inside query functions will raise an error.Implementation details:
PipelineAnalysisContextto Spark Connect's user context extensions, enabling the server to identify requests originating from Spark Declarative Pipelines and apply appropriate restrictions.flow_namefield inPipelineAnalysisContextdetermines execution behavior:flow_nameis set): Spark Connect server treatsspark.sql()as a no-op and returns the raw logical plan to SDP for deferred analysis as part of the Dataflow Graph.flow_nameis empty): Spark Connect server eagerly executes the command, but only SDP-allowlisted commands are permitted.Why are the changes needed?
spark.sql(...)is a common and intuitive pattern for users who are more familiar with SQL to define query functions. Supporting this API improves usability and allows SQL-first developers to work more naturally with Spark Declarative Pipelines.Does this PR introduce any user-facing change?
Yes. Previously,
spark.sql(...)inside query functions was not supported and users would see anATTEMPT_ANALYSIS_IN_PIPELINE_QUERY_FUNCTIONexception. This PR lifts that restriction.How was this patch tested?
New test cases in
PythonPipelineSuiteunit testWas this patch authored or co-authored using generative AI tooling?
No