[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` #46007

zhengruifeng · 2024-04-11T11:30:13Z

What changes were proposed in this pull request?

Document the lazy evaluation of views in spark.{sql, table}

Why are the changes needed?

it is by design in Spark Connect, so we need to document it

Does this PR introduce any user-facing change?

doc change

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

init

xinrong-meng · 2024-04-11T20:35:57Z

python/pyspark/sql/session.py

+        In Classic Spark, a referenced temporary view is resolved immediately, while in Spark
+        Connect it is lazy evaluated.
+        So in Spark Connect if a view is dropped, modified or replaced after `spark.sql`, the
+        execution may fail or generate different results.


Out of cusiority, in which cases the execution may fail?

drop the view, for example

df = ... df.createTempView("some_view") df2 = spark.sql("SELECT * FROM some_view") spark.catalog.dropTempView("some_view") <- drop the view df2.show() <- should fail in Spark Connect

xinrong-meng · 2024-04-11T20:36:07Z

LGTM thanks!

zhengruifeng · 2024-04-12T09:01:09Z

thanks @HyukjinKwon and @xinrong-meng

merged to master

allisonwang-db · 2024-04-16T23:39:00Z

python/pyspark/sql/session.py

@@ -1630,6 +1630,13 @@ def sql(
        -------
        :class:`DataFrame`

+        Notes
+        -----
+        In Spark Classic, a temporary view referenced in `spark.sql` is resolved immediately,


How about temp functions?

allisonwang-db · 2024-04-16T23:45:08Z

python/pyspark/sql/session.py

+        Notes
+        -----
+        In Spark Classic, a temporary view referenced in `spark.sql` is resolved immediately,
+        while in Spark Connect it is lazily evaluated.


I think this note might be very confusing to users, as data frames in Spark are all lazily evaluated, right? Maybe we can say "it is lazily analyzed".

We should probably document this as a behavior change for Spark Connect. I am pretty sure there are other behavior changes. Also does this lazy analysis apply to persistent tables and views as well?

sounds good, let me update with it is lazily analyzed.

Besides temp views, this lazy analysis apply to temp functions / configurations / persistent tables.

If the functions/configurations/tables are changed after spark.table/sql, the results may be different from Spark Classic.

other dataframe APIs may also have the same behavior change, we probably need to document it somewhere like docs/spark-connect-overview.md

### What changes were proposed in this pull request? `lazily evaluated` -> `lazily analyzed` ### Why are the changes needed? to address #46007 (comment) Closes #46118 from zhengruifeng/doc_nit. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

init

c1cc245

init

github-actions bot added SQL PYTHON labels Apr 11, 2024

nit

830bf95

xinrong-meng reviewed Apr 11, 2024

View reviewed changes

xinrong-meng approved these changes Apr 11, 2024

View reviewed changes

zhengruifeng requested a review from HyukjinKwon April 12, 2024 00:00

HyukjinKwon approved these changes Apr 12, 2024

View reviewed changes

a bit refine

e3fbdae

zhengruifeng force-pushed the doc_connect_sql_table branch from 31032cb to e3fbdae Compare April 12, 2024 05:47

Spark Classic

7ea1791

HyukjinKwon approved these changes Apr 12, 2024

View reviewed changes

zhengruifeng closed this in 49d2214 Apr 12, 2024

zhengruifeng deleted the doc_connect_sql_table branch April 12, 2024 09:01

allisonwang-db reviewed Apr 16, 2024

View reviewed changes

zhengruifeng mentioned this pull request Apr 18, 2024

[SPARK-47816][CONNECT][DOCS][FOLLOWUP] refine the description #46118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` #46007

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` #46007

zhengruifeng commented Apr 11, 2024

xinrong-meng Apr 11, 2024

zhengruifeng Apr 11, 2024

xinrong-meng commented Apr 11, 2024

zhengruifeng commented Apr 12, 2024

allisonwang-db Apr 16, 2024

allisonwang-db Apr 16, 2024

zhengruifeng Apr 18, 2024

zhengruifeng Apr 18, 2024 •

edited

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in spark.{sql, table} #46007

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in spark.{sql, table} #46007

Conversation

zhengruifeng commented Apr 11, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng Apr 11, 2024

Choose a reason for hiding this comment

zhengruifeng Apr 11, 2024

Choose a reason for hiding this comment

xinrong-meng commented Apr 11, 2024

zhengruifeng commented Apr 12, 2024

allisonwang-db Apr 16, 2024

Choose a reason for hiding this comment

allisonwang-db Apr 16, 2024

Choose a reason for hiding this comment

zhengruifeng Apr 18, 2024

Choose a reason for hiding this comment

zhengruifeng Apr 18, 2024 • edited

Choose a reason for hiding this comment

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` #46007

[SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` #46007

zhengruifeng Apr 18, 2024 •

edited