[SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic #49055

xinrong-meng · 2024-12-04T08:01:12Z

What changes were proposed in this pull request?

Support DataFrame conversion to table arguments in Spark Classic, and enable UDTFs to accept table arguments in both PySpark and Scala.

Spark Connect support will be a follow-up, with the goal of completing it by the end of this month.

Why are the changes needed?

Part of SPARK-50391.
Table-Valued Functions (TVFs) and User-Defined Table Functions (UDTFs) are widely used in Spark workflows. These functions often require a table argument, which Spark internally represents as a Catalyst expression. While Spark SQL supports constructs like TABLE() for this purpose, there is no direct API in PySpark or Scala to convert a DataFrame into a table argument. So we propose to support DataFrame conversion to table arguments (in Spark Classic first), and enable UDTFs to accept table arguments in both PySpark and Scala..

Does this PR introduce any user-facing change?

Yes DataFrame conversion to table argument is supported in Spark Classic, and UDTFs accept table arguments in both PySpark and Scala.

>>> from pyspark.sql.functions import udtf
>>> from pyspark.sql import Row
>>> 
>>> @udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, row: Row):
...         if row[0] > 5:
...             yield row[0],
... 
>>> df = spark.range(8)
>>> 
>>> TestUDTF(df.asTable()).show()
+---+                                                                           
|  a|
+---+
|  6|
|  7|
+---+

>>> TestUDTF(df.asTable().partitionBy(df.id)).show()
+---+
|  a|
+---+
|  6|
|  7|
+---+

>>> TestUDTF(df.asTable().partitionBy(df.id).orderBy(df.id)).show()
+---+
|  a|
+---+
|  6|
|  7|
+---+

>>> TestUDTF(df.asTable().withSinglePartition()).show()
+---+
|  a|
+---+
|  6|
|  7|
+---+

>>> TestUDTF(df.asTable().partitionBy(df.id).withSinglePartition()).show()
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.IllegalArgumentException: Cannot call withSinglePartition() after partitionBy() has been called.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon

I reviewed the design, and LGTM

ueshin · 2024-12-16T23:55:06Z

python/pyspark/sql/udtf_argument.py

How about TVFArgument, TableFunctionArgument or TableValuedFunctionArgument?
In SQL, this is not only for UDTF but TVFs in general, although currently there is no builtin tvf that supports table arguments.
cc @dtenedor @allisonwang-db

Good point! UDTFs are a type of TVFs, how about TableValuedFunctionArgument?

TableValuedFunctionArgument sounds good. This way we don't need to limit it to user-defined table functions.

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala

python/pyspark/sql/tests/test_udtf.py

ueshin · 2024-12-19T19:17:33Z

python/pyspark/sql/tests/test_udtf.py

Wha't the behavior of this? In SQL, order by only is not allowed, IIRC. cc @dtenedor

Similarly, what happens if, e.g., func(df.asTable().partitionBy(df.key).orderBy(df.value)).partitionBy()?

It is allowed here

>>> TestUDTF(df.asTable().partitionBy("id").orderBy("id").partitionBy()).show() +---+ | a| +---+ | 6| | 7| +---+

@dtenedor Wha't the behavior of this? I don't think we should allow this?

order by only is not supported in SQL as

[PARSE_SYNTAX_ERROR] Syntax error at or near 'ORDER'. SQLSTATE: 42601 (line 1, pos 59) == SQL == SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) ORDER BY id) -----------------------------------------------------------^^^

Adjusted

Multiple partition bys are not supported as

[PARSE_SYNTAX_ERROR] Syntax error at or near 'PARTITION'. SQLSTATE: 42601 (line 1, pos 87) == SQL == SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) PARTITION BY id ORDER BY id PARTITION BY id) ---------------------------------------------------------------------------------------^^^

or

== SQL == SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) PARTITION BY id PARTITION BY id) ---------------------------------------------------------------------------^^^

Adjusted.

python/pyspark/sql/udtf.py

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala

python/pyspark/sql/tests/test_udtf.py

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala

ueshin · 2024-12-27T19:15:07Z

python/pyspark/sql/tests/test_udtf.py

Could you add tests with named arguments?

def partitionBy(self, *cols: "ColumnOrName") -> "TableArg":
does not accept keyword arguments, would you clarify what we are expecting here?

func(row = df.asTable() ...)

Got it, included named arguments

I created SPARK-50392 as a follow-up to support named arguments.
It requires a change in PythonSQLUtils.namedArgumentExpression, which depends on the TableArg class in Spark Connect.

python/pyspark/sql/table_arg.py

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala

ueshin

Otherwise, LGTM, pending #49055 (comment) and tests.

ueshin · 2025-01-03T04:06:53Z

.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala

        "org.apache.spark.sql.ExtendedExplainGenerator"),
      ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.UDTFRegistration"),
      ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataSourceRegistration"),
+      ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.TableArg$"),


Do we need this?

I think so, I added the line due to a test failure hint.
Let me verify here.

xinrong-meng · 2025-01-03T06:31:30Z

Otherwise, LGTM, pending #49055 (comment) and tests.

I appreciate your detailed review. That's very helpful!

dongjoon-hyun

Hi, @xinrong-meng , @HyukjinKwon , @ueshin , @allisonwang-db .

Newly added table_arg.py seems to break Spark Connect Python-only CI. Could you take a look at the failure, please?

https://github.com/apache/spark/blob/master/.github/workflows/build_python_connect.yml
- https://github.com/apache/spark/actions/runs/12657646176/job/35285622746

  File "/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/sql/table_arg.py", line 20, in <module>
    from pyspark.sql.classic.column import _to_java_column, _to_seq
ModuleNotFoundError: No module named 'pyspark.sql.classic'

ueshin · 2025-01-13T22:30:52Z

@dongjoon-hyun I submitted a follow-up PR #49472 to fix it. Thanks. cc @xinrong-meng

dongjoon-hyun · 2025-01-13T22:32:19Z

Thank you!

…onnect-only` builds ### What changes were proposed in this pull request? Move imports into methods to fix connect-only builds. ### Why are the changes needed? #49055 broke the connect-only builds: #49055 (review) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49472 from ueshin/issues/SPARK-50392/fup. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added SQL PYTHON CONNECT labels Dec 4, 2024

xinrong-meng force-pushed the TableArg branch from 955e274 to 5a615b8 Compare December 13, 2024 01:42

xinrong-meng changed the title ~~[WIP][SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic~~ [SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic Dec 13, 2024

xinrong-meng marked this pull request as ready for review December 13, 2024 01:46

HyukjinKwon approved these changes Dec 16, 2024

View reviewed changes

ueshin reviewed Dec 17, 2024

View reviewed changes

xinrong-meng requested a review from ueshin December 19, 2024 02:25

ueshin reviewed Dec 19, 2024

View reviewed changes

ueshin reviewed Dec 27, 2024

View reviewed changes

xinrong-meng added 19 commits January 2, 2025 10:47

init

ff3338a

rename asTable

16605cf

UDTFArgument in py

60ba99a

tests

f7db061

orderBy

de530fc

err test

8fb6c64

type

7ef2f44

rebase

b5be692

tests

e861ca9

import

386d3ff

fix

adf3e6a

lint

85fd6c0

fix test

4ddcf6d

comments

f7a4bc6

lint

e8b0223

orderBy test

4273da1

rename

43300f1

comments

2e3ff12

lint fix

4ad1578

xinrong-meng added 5 commits January 2, 2025 10:47

fix test

d8d680d

- multiple partition by

3c0c4a7

fix

8adff7c

rmv terminate

8ec6f0e

comments: fmt, param default values

9ce739e

xinrong-meng force-pushed the TableArg branch from e56b77c to 9ce739e Compare January 2, 2025 02:57

xinrong-meng added 5 commits January 2, 2025 12:30

- isPartitioned; err msg

af7007e

modifier

554e8be

- val

7ecf8d8

fix

85f0f69

tests

abcc08a

ueshin reviewed Jan 2, 2025

View reviewed changes

python/pyspark/sql/table_arg.py Show resolved Hide resolved

python/pyspark/sql/table_arg.py Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala Outdated Show resolved Hide resolved

ueshin reviewed Jan 2, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/TableArg.scala Outdated Show resolved Hide resolved

xinrong-meng added 2 commits January 3, 2025 11:15

lint

1466d9e

comments

effecef

ueshin approved these changes Jan 3, 2025

View reviewed changes

xinrong-meng added 4 commits January 3, 2025 13:33

fmt

0d084e6

tests

61e52a7

- test named argument

9925ee5

- TableArg$

ce433b2

xinrong-meng added 2 commits January 6, 2025 12:50

lint

8f430b6

Merge branch 'master' into TableArg

7ff77af

xinrong-meng closed this in e24e7b4 Jan 7, 2025

dongjoon-hyun reviewed Jan 12, 2025

View reviewed changes

ueshin mentioned this pull request Jan 13, 2025

[SPARK-50392][PYTHON][FOLLOWUP] Move imports into methods to fix connect-only builds #49472

Closed

[SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic #49055

[SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic #49055

Uh oh!

Conversation

xinrong-meng commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xinrong-meng commented Dec 4, 2024 •

edited

Loading

ueshin Dec 16, 2024 •

edited

Loading

ueshin Dec 26, 2024 •

edited

Loading

xinrong-meng Dec 27, 2024 •

edited

Loading

xinrong-meng Dec 27, 2024 •

edited

Loading

xinrong-meng Jan 2, 2025 •

edited

Loading

xinrong-meng Jan 3, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading