-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50392][PYTHON] DataFrame conversion to table argument in Spark Classic #49055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
955e274 to
5a615b8
Compare
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the design, and LGTM
python/pyspark/sql/udtf_argument.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about TVFArgument, TableFunctionArgument or TableValuedFunctionArgument?
In SQL, this is not only for UDTF but TVFs in general, although currently there is no builtin tvf that supports table arguments.
cc @dtenedor @allisonwang-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! UDTFs are a type of TVFs, how about TableValuedFunctionArgument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableValuedFunctionArgument sounds good. This way we don't need to limit it to user-defined table functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wha't the behavior of this? In SQL, order by only is not allowed, IIRC. cc @dtenedor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, what happens if, e.g., func(df.asTable().partitionBy(df.key).orderBy(df.value)).partitionBy()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is allowed here
>>> TestUDTF(df.asTable().partitionBy("id").orderBy("id").partitionBy()).show()
+---+
| a|
+---+
| 6|
| 7|
+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtenedor Wha't the behavior of this? I don't think we should allow this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order by only is not supported in SQL as
[PARSE_SYNTAX_ERROR] Syntax error at or near 'ORDER'. SQLSTATE: 42601 (line 1, pos 59)
== SQL ==
SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) ORDER BY id)
-----------------------------------------------------------^^^
Adjusted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple partition bys are not supported as
[PARSE_SYNTAX_ERROR] Syntax error at or near 'PARTITION'. SQLSTATE: 42601 (line 1, pos 87)
== SQL ==
SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) PARTITION BY id ORDER BY id PARTITION BY id)
---------------------------------------------------------------------------------------^^^
or
== SQL ==
SELECT * FROM test_udtf(TABLE (SELECT id FROM range(0, 8)) PARTITION BY id PARTITION BY id)
---------------------------------------------------------------------------^^^
Adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests with named arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def partitionBy(self, *cols: "ColumnOrName") -> "TableArg":
does not accept keyword arguments, would you clarify what we are expecting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func(row = df.asTable() ...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, included named arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created SPARK-50392 as a follow-up to support named arguments.
It requires a change in PythonSQLUtils.namedArgumentExpression, which depends on the TableArg class in Spark Connect.
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM, pending #49055 (comment) and tests.
| "org.apache.spark.sql.ExtendedExplainGenerator"), | ||
| ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.UDTFRegistration"), | ||
| ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.DataSourceRegistration"), | ||
| ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.TableArg$"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, I added the line due to a test failure hint.
Let me verify here.
I appreciate your detailed review. That's very helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @xinrong-meng , @HyukjinKwon , @ueshin , @allisonwang-db .
Newly added table_arg.py seems to break Spark Connect Python-only CI. Could you take a look at the failure, please?
File "/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/pyspark/sql/table_arg.py", line 20, in <module>
from pyspark.sql.classic.column import _to_java_column, _to_seq
ModuleNotFoundError: No module named 'pyspark.sql.classic'
|
@dongjoon-hyun I submitted a follow-up PR #49472 to fix it. Thanks. cc @xinrong-meng |
|
Thank you! |
…onnect-only` builds ### What changes were proposed in this pull request? Move imports into methods to fix connect-only builds. ### Why are the changes needed? #49055 broke the connect-only builds: #49055 (review) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49472 from ueshin/issues/SPARK-50392/fup. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Support DataFrame conversion to table arguments in Spark Classic, and enable UDTFs to accept table arguments in both PySpark and Scala.
Spark Connect support will be a follow-up, with the goal of completing it by the end of this month.
Why are the changes needed?
Part of SPARK-50391.
Table-Valued Functions (TVFs) and User-Defined Table Functions (UDTFs) are widely used in Spark workflows. These functions often require a table argument, which Spark internally represents as a Catalyst expression. While Spark SQL supports constructs like TABLE() for this purpose, there is no direct API in PySpark or Scala to convert a DataFrame into a table argument. So we propose to support DataFrame conversion to table arguments (in Spark Classic first), and enable UDTFs to accept table arguments in both PySpark and Scala..
Does this PR introduce any user-facing change?
Yes DataFrame conversion to table argument is supported in Spark Classic, and UDTFs accept table arguments in both PySpark and Scala.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.