-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-50220][PYTHON] Support listagg in PySpark #49231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50220][PYTHON] Support listagg in PySpark #49231
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zhengruifeng FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also list this in python/docs/source/reference/pyspark.sql/functions.rst
please also add new function into |
---------- | ||
col : :class:`~pyspark.sql.Column` or column name | ||
target column to compute on. | ||
delimiter : :class:`~pyspark.sql.Column` or str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delimiter : :class:`~pyspark.sql.Column` or str, optional | |
delimiter : :class:`~pyspark.sql.Column`, literal string or bytes, optional |
---------- | ||
col : :class:`~pyspark.sql.Column` or column name | ||
target column to compute on. | ||
delimiter : :class:`~pyspark.sql.Column` or str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delimiter : :class:`~pyspark.sql.Column` or str, optional | |
delimiter : :class:`~pyspark.sql.Column`, literal string or bytes, optional |
# TODO(SPARK-50220): listagg functions will soon be added and removed from this list | ||
"listagg_distinct", | ||
"listagg", | ||
"string_agg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to add the aliases string_agg
and string_agg_distinct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ofc. Just wanted to achieve green ci for listagg
first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
col: "ColumnOrName", delimiter: Optional[Union[Column, str, bytes]] = None | ||
) -> Column: | ||
if delimiter is None: | ||
return _invoke_function_over_columns("listagg_distinct", col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this won't work, since we don't have listagg_distinct
in FunctionRegistry
.
In spark connect, _invoke_function_over_columns
just build a unresolved function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can refer to
spark/python/pyspark/sql/connect/functions/builtin.py
Lines 1090 to 1096 in 05750de
def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column: | |
from pyspark.sql.connect.column import Column as ConnectColumn | |
_exprs = [_to_col(c)._expr for c in [col] + list(cols)] | |
return ConnectColumn( | |
UnresolvedFunction("count", _exprs, is_distinct=True) # type: ignore[arg-type] | |
) |
thanks, merged to master |
What changes were proposed in this pull request?
Added new function
listagg
to pyspark.Follow-up of #48748.
Why are the changes needed?
Allows to use native Python functions to write queries with
listagg
. E.g.,df.select(F.listagg(df.value, ",").alias("r"))
.Does this PR introduce any user-facing change?
Yes, new functions
listagg
andlistagg_distinct
(with aliasesstring_agg
andstring_agg_distinct
) in pyspark.How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot