[SPARK-54020] Support `spark.sql(...)` Python API inside query functions for Spark Declarative Pipeline #53024

SCHJonathan · 2025-11-12T22:56:04Z

What changes were proposed in this pull request?

This PR adds support for spark.sql(...) Python API inside query functions for Spark Declarative Pipelines. Users can now use spark.sql(...) to define query functions, and dependencies are correctly tracked.

Example usage:

@dp.materialized_view()
def source():
    return spark.sql("SELECT * FROM RANGE(5)")

@dp.materialized_view()
def target():
    return spark.sql("SELECT * FROM source")

This PR also adds restrictions on the set of SQL commands users can execute. Unsupported commands (e.g., spark.sql("CREATE TABLE ...")) inside query functions will raise an error.

Implementation details:

Added PipelineAnalysisContext to Spark Connect's user context extensions, enabling the server to identify requests originating from Spark Declarative Pipelines and apply appropriate restrictions.
The flow_name field in PipelineAnalysisContext determines execution behavior:
- Inside query functions (flow_name is set): Spark Connect server treats spark.sql() as a no-op and returns the raw logical plan to SDP for deferred analysis as part of the Dataflow Graph.
- Outside query functions (flow_name is empty): Spark Connect server eagerly executes the command, but only SDP-allowlisted commands are permitted.

Why are the changes needed?

spark.sql(...) is a common and intuitive pattern for users who are more familiar with SQL to define query functions. Supporting this API improves usability and allows SQL-first developers to work more naturally with Spark Declarative Pipelines.

Does this PR introduce any user-facing change?

Yes. Previously, spark.sql(...) inside query functions was not supported and users would see an ATTEMPT_ANALYSIS_IN_PIPELINE_QUERY_FUNCTION exception. This PR lifts that restriction.

How was this patch tested?

New test cases in PythonPipelineSuite unit test

Was this patch authored or co-authored using generative AI tooling?

No

sryza

There are a bunch of code cleanup changes here that seem great but are outside the critical path of the main goal of this PR (supporting spark.sql inside pipelines). Would it be difficult to move those changes into a separate PR to reduce risk?

python/pyspark/pipelines/add_pipeline_analysis_context.py

SCHJonathan · 2025-11-12T23:29:55Z

There are a bunch of code cleanup changes here that seem great but are outside the critical path of the main goal of this PR (supporting spark.sql inside pipelines). Would it be difficult to move those changes into a separate PR to reduce risk?

@sryza I will polish up the PR more to reflect this but unfortunately I think most of the changes are necessary:

We currently don't support eager analysis / execution outside flow function that needs to go through pipeline analysis (e.g., spark.sql("SELECT * FROM external_table") outside the flow function or spark.read.table("external_table").schema). They need to go through pipeline analysis otherwise identifier won't be correctly qualified with current catalog / schema tracked inside the pipeline. I introduced a ExternalQueryAnalysisContext to handle that
Currently, changing current catalog / schema is a SQL only concept and related current catalog / schema tracking logic is inside SqlGraphRegisterationContext, I need to port that to GraphRegisterationContext as Python is using that.
There are a few unrelated format change caused by my local scalafmt, will revert these before requesting formal review

dongjoon-hyun

Hi, @SCHJonathan .

If you don't mind, please file a JIRA issue in the ASF community repository. It will help you prevent potential accidents like the following commits.

c707f59 Add PipelineAnalysisContext message to support pipeline analysis during Spark Connect query execution
f03c644 Fix: SparkML-connect can't load SparkML (legacy mode) saved model

SCHJonathan · 2025-11-12T23:33:38Z

@dongjoon-hyun Absolutely! Thanks for the reminding!

sryza · 2025-11-13T00:07:22Z

I think there's an existing JIRA for this: https://issues.apache.org/jira/browse/SPARK-54020

dongjoon-hyun · 2025-11-13T02:14:58Z

I think there's an existing JIRA for this: https://issues.apache.org/jira/browse/SPARK-54020

Please lead the contributor to update the PR title before starting review, @sryza .

Addressed.

sryza

Thanks for this @SCHJonathan ! This closes a significant issue with declarative pipelines that would be great to get in this week and have fixed for Spark 4.1.

python/pyspark/pipelines/add_pipeline_analysis_context.py

python/pyspark/pipelines/block_connect_access.py

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

python/pyspark/pipelines/block_connect_access.py

sryza

LGTM! I'll wait until the underlying user context extensions PR gets merged before merging this.

sryza · 2025-11-13T16:38:02Z

python/pyspark/pipelines/block_connect_access.py

+
+
 @contextmanager
 def block_spark_connect_execution_and_analysis() -> Generator[None, None, None]:


Now that we have the context on the server side, it might make more sense to block these operations there – then we don't need to replicate this weird monkeypatching logic across all the clients when we add support for other languages. Doesn't need to be part of this PR though.

dongjoon-hyun · 2025-11-13T17:29:47Z

Could you make the CI happy, @SCHJonathan ? There are several test pipeline failures.

SCHJonathan · 2025-11-13T19:50:34Z

Could you make the CI happy, @SCHJonathan ? There are several test pipeline failures.

Absolutely, just fixed

dongjoon-hyun · 2025-11-13T22:27:29Z

Thank you. Could you check these too?

[info] *** 26 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.connect.pipelines.PythonPipelineSuite
[error] 	org.apache.spark.sql.connect.pipelines.SparkDeclarativePipelinesServerSuite

dongjoon-hyun · 2025-11-14T00:56:32Z

Sounds good to me, too.

dongjoon-hyun · 2025-11-14T02:24:17Z

It turns out we revealed another failures which Python failures hide from us.

[info] *** 21 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.connect.pipelines.EndToEndAPISuite

SCHJonathan · 2025-11-14T02:27:00Z

[PIPELINE_STORAGE_ROOT_INVALID] Pipeline storage root must be an absolute path with a URI scheme (e.g., file://, s3a://, hdfs://). Got: `/home/runner/work/spark/spark/target/tmp/spark-06c9bfe0-9410-4887-a376-2eec929a70de/storage`. SQLSTATE: 42K03
Working on a fix

…tests

… to all tests" This reverts commit e483cb0.

This reverts commit 433b537.

SCHJonathan · 2025-11-14T06:15:59Z

@dongjoon-hyun , validate the CI fix works. File a ticket and create a CI fix PR: #53058

…ndard==0.25.0` ### What changes were proposed in this pull request? In #53024 (comment), PR CI Python unit tests failed due to ``` pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] zstandard >= 0.25.0 must be installed; however, it was not found. ``` This PR add the required dependency to the pre-merge CI. ### Why are the changes needed? Recover Python unit tests CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? PR #53024 Python CI back to healthy with this change ### Was this patch authored or co-authored using generative AI tooling? No Closes #53058 from SCHJonathan/jonathan-chang_data/fix-python-ci-dep. Authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ndard==0.25.0` ### What changes were proposed in this pull request? In #53024 (comment), PR CI Python unit tests failed due to ``` pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] zstandard >= 0.25.0 must be installed; however, it was not found. ``` This PR add the required dependency to the pre-merge CI. ### Why are the changes needed? Recover Python unit tests CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? PR #53024 Python CI back to healthy with this change ### Was this patch authored or co-authored using generative AI tooling? No Closes #53058 from SCHJonathan/jonathan-chang_data/fix-python-ci-dep. Authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a916690) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2025-11-14T19:25:12Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

+  test("reading external datasets outside query function works") {
+    sql("CREATE TABLE spark_catalog.default.src AS SELECT * FROM RANGE(5)")
+    val graph = buildGraph(s"""
+                    |spark_sql_df = spark.sql("SELECT * FROM spark_catalog.default.src")


Indentation?

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

SCHJonathan · 2025-11-14T23:20:20Z

@sryza @dongjoon-hyun Hi all, managed to get a all green CI, and it's ready to be merged

dongjoon-hyun

+1, LGTM. Thank you, @SCHJonathan and @sryza .

Merged to master/4.1 for Apache Spark 4.1.0.

…ons for Spark Declarative Pipeline ### What changes were proposed in this pull request? This PR adds support for `spark.sql(...)` Python API inside query functions for Spark Declarative Pipelines. Users can now use `spark.sql(...)` to define query functions, and dependencies are correctly tracked. **Example usage:** ```python dp.materialized_view() def source(): return spark.sql("SELECT * FROM RANGE(5)") dp.materialized_view() def target(): return spark.sql("SELECT * FROM source") ``` This PR also adds restrictions on the set of SQL commands users can execute. Unsupported commands (e.g., `spark.sql("CREATE TABLE ...")`) inside query functions will raise an error. **Implementation details:** 1. Added `PipelineAnalysisContext` to Spark Connect's user context extensions, enabling the server to identify requests originating from Spark Declarative Pipelines and apply appropriate restrictions. 2. The `flow_name` field in `PipelineAnalysisContext` determines execution behavior: - **Inside query functions** (`flow_name` is set): Spark Connect server treats `spark.sql()` as a no-op and returns the raw logical plan to SDP for deferred analysis as part of the Dataflow Graph. - **Outside query functions** (`flow_name` is empty): Spark Connect server eagerly executes the command, but only SDP-allowlisted commands are permitted. ### Why are the changes needed? `spark.sql(...)` is a common and intuitive pattern for users who are more familiar with SQL to define query functions. Supporting this API improves usability and allows SQL-first developers to work more naturally with Spark Declarative Pipelines. ### Does this PR introduce _any_ user-facing change? Yes. Previously, `spark.sql(...)` inside query functions was not supported and users would see an `ATTEMPT_ANALYSIS_IN_PIPELINE_QUERY_FUNCTION` exception. This PR lifts that restriction. ### How was this patch tested? New test cases in `PythonPipelineSuite` unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #53024 from SCHJonathan/jonathan-chang_data/spark-sql. Authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cc72c64) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2025-11-15T00:45:17Z

@SCHJonathan . What is your Apache JIRA ID? I want to assign SPARK-54020 to you.

github-actions bot added SQL PYTHON CONNECT labels Nov 12, 2025

sryza reviewed Nov 12, 2025

View reviewed changes

python/pyspark/pipelines/add_pipeline_analysis_context.py Outdated Show resolved Hide resolved

python/pyspark/pipelines/add_pipeline_analysis_context.py Outdated Show resolved Hide resolved

dongjoon-hyun marked this pull request as draft November 12, 2025 23:31

dongjoon-hyun previously requested changes Nov 12, 2025

View reviewed changes

done

e515b85

SCHJonathan force-pushed the jonathan-chang_data/spark-sql branch from 17800b5 to e515b85 Compare November 13, 2025 03:13

SCHJonathan changed the title ~~Jonathan chang data/spark sql~~ [SPARK-54020] Support spark.sql(...) Python API for Spark Declarative Pipeline Nov 13, 2025

SCHJonathan changed the title ~~[SPARK-54020] Support spark.sql(...) Python API for Spark Declarative Pipeline~~ [SPARK-54020] Support spark.sql(...) Python API inside query functions for Spark Declarative Pipeline Nov 13, 2025

SCHJonathan requested a review from sryza November 13, 2025 03:36

add python unit tests

c6b88d0

dongjoon-hyun marked this pull request as ready for review November 13, 2025 04:32

add more tests

0641991

sryza reviewed Nov 13, 2025

View reviewed changes

sryza mentioned this pull request Nov 13, 2025

[SPARK-54324] Add support for client-user-context-extensions #53020

Open

SCHJonathan added 2 commits November 12, 2025 21:57

sandy

a5a3fbb

test failure

506dcdd

SCHJonathan requested a review from sryza November 13, 2025 06:08

sryza approved these changes Nov 13, 2025

View reviewed changes

fix tests

4a74125

fix test_client

56d5ece

github-actions bot added the INFRA label Nov 14, 2025

SCHJonathan added 6 commits November 13, 2025 18:44

fix EndToEndAPISuite

5c214d7

fix test_client resource leak

433b537

fix test_init_cli

887ee0a

fix all resource leaks in test_client.py - add client.close() to all …

e483cb0

…tests

Revert "fix all resource leaks in test_client.py - add client.close()…

8da65ce

… to all tests" This reverts commit e483cb0.

Revert "fix test_client resource leak"

fe6f1cd

This reverts commit 433b537.

SCHJonathan mentioned this pull request Nov 14, 2025

[SPARK-54348][INFRA] Recover Python unit tests CI by installing zstandard==0.25.0 #53058

Closed

fix PythonPipelineSuite

435ff6c

Merge branch 'master' into spark-sql

5b43ba0

SCHJonathan added 2 commits November 14, 2025 11:21

Merge branch 'master' into spark-sql

7ab17fa

revert test_client changes

314d69e

github-actions bot removed the INFRA label Nov 14, 2025

dongjoon-hyun reviewed Nov 14, 2025

View reviewed changes

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 14, 2025

View reviewed changes

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 14, 2025

View reviewed changes

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala Outdated Show resolved Hide resolved

fix indent

b67b486

dongjoon-hyun approved these changes Nov 15, 2025

View reviewed changes

dongjoon-hyun closed this in cc72c64 Nov 15, 2025



		@contextmanager
		def block_spark_connect_execution_and_analysis() -> Generator[None, None, None]:

[SPARK-54020] Support spark.sql(...) Python API inside query functions for Spark Declarative Pipeline #53024

[SPARK-54020] Support spark.sql(...) Python API inside query functions for Spark Declarative Pipeline #53024

Uh oh!

Conversation

SCHJonathan commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SCHJonathan commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SCHJonathan commented Nov 12, 2025

Uh oh!

sryza commented Nov 13, 2025

Uh oh!

dongjoon-hyun commented Nov 13, 2025

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 13, 2025

Uh oh!

SCHJonathan commented Nov 13, 2025

Uh oh!

dongjoon-hyun commented Nov 13, 2025

Uh oh!

dongjoon-hyun commented Nov 14, 2025

Uh oh!

dongjoon-hyun commented Nov 14, 2025

Uh oh!

SCHJonathan commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SCHJonathan commented Nov 14, 2025

Uh oh!

dongjoon-hyun Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SCHJonathan commented Nov 14, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

[SPARK-54020] Support `spark.sql(...)` Python API inside query functions for Spark Declarative Pipeline #53024

[SPARK-54020] Support `spark.sql(...)` Python API inside query functions for Spark Declarative Pipeline #53024

SCHJonathan commented Nov 12, 2025 •

edited

Loading

SCHJonathan commented Nov 12, 2025 •

edited

Loading

SCHJonathan commented Nov 14, 2025 •

edited

Loading