Skip to content

Conversation

mikhailnik-db
Copy link
Contributor

@mikhailnik-db mikhailnik-db commented Dec 18, 2024

What changes were proposed in this pull request?

Added new function listagg to pyspark.
Follow-up of #48748.

Why are the changes needed?

Allows to use native Python functions to write queries with listagg. E.g., df.select(F.listagg(df.value, ",").alias("r")).

Does this PR introduce any user-facing change?

Yes, new functions listagg and listagg_distinct (with aliases string_agg and string_agg_distinct) in pyspark.

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot

@mikhailnik-db mikhailnik-db changed the title [WIP][SPARK-50220] add listagg to pyspark [WIP][SPARK-50220][SQL] Support listagg in PySpark Dec 18, 2024
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zhengruifeng FYI

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also list this in python/docs/source/reference/pyspark.sql/functions.rst

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Dec 19, 2024

please also add new function into python/pyspark/sql/functions/__init__.py

----------
col : :class:`~pyspark.sql.Column` or column name
target column to compute on.
delimiter : :class:`~pyspark.sql.Column` or str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delimiter : :class:`~pyspark.sql.Column` or str, optional
delimiter : :class:`~pyspark.sql.Column`, literal string or bytes, optional

----------
col : :class:`~pyspark.sql.Column` or column name
target column to compute on.
delimiter : :class:`~pyspark.sql.Column` or str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delimiter : :class:`~pyspark.sql.Column` or str, optional
delimiter : :class:`~pyspark.sql.Column`, literal string or bytes, optional

# TODO(SPARK-50220): listagg functions will soon be added and removed from this list
"listagg_distinct",
"listagg",
"string_agg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to add the aliases string_agg and string_agg_distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ofc. Just wanted to achieve green ci for listagg first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

col: "ColumnOrName", delimiter: Optional[Union[Column, str, bytes]] = None
) -> Column:
if delimiter is None:
return _invoke_function_over_columns("listagg_distinct", col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this won't work, since we don't have listagg_distinct in FunctionRegistry.

In spark connect, _invoke_function_over_columns just build a unresolved function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can refer to

def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column:
from pyspark.sql.connect.column import Column as ConnectColumn
_exprs = [_to_col(c)._expr for c in [col] + list(cols)]
return ConnectColumn(
UnresolvedFunction("count", _exprs, is_distinct=True) # type: ignore[arg-type]
)

@github-actions github-actions bot added the DOCS label Dec 20, 2024
@mikhailnik-db mikhailnik-db changed the title [WIP][SPARK-50220][SQL] Support listagg in PySpark [SPARK-50220][SQL] Support listagg in PySpark Dec 20, 2024
@mikhailnik-db mikhailnik-db changed the title [SPARK-50220][SQL] Support listagg in PySpark [SPARK-50220][PYTHON] Support listagg in PySpark Dec 20, 2024
@zhengruifeng
Copy link
Contributor

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants