[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787
[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787landlord-matt wants to merge 4 commits intoapache:masterfrom
Conversation
|
@HyukjinKwon: I now also did a similar thing for collect_set. Any comments on the proposal? Both the idea of clarifying this and the solution. I thought you would appreciate this suggestion :'( |
|
I personally think using |
|
@zhengruifeng: Interesting, because the problem I had was that Python has the built in types list and set and they have different properties. If I declare a variable as a set, it will ignore duplicates going forward, but if I declare it a list, it will accept it. In Java/Scala they differentiate arrays, lists and sets. I think the naming refers to that the collection step behaves like a list or a set, but it returns a standard array and will behave like a standard array in subsequent steps. If you have memorized the available types in Spark, it is perhaps obvious that it will be an ArrayType, but I think I think the documentation should be clear on the specific return type for beginners as well. |
|
This is how other functions have solved this @_try_remote_functions
def find_in_set(str: "ColumnOrName", str_array: "ColumnOrName") -> Column:
"""
Returns the index (1-based) of the given string (`str`) in the comma-delimited
list (`strArray`). Returns 0, if the string was not found or if the given string (`str`)
contains a comma.
.. versionadded:: 3.5.0def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
"""
Splits str around matches of the given pattern.
Returns
-------
:class:`~pyspark.sql.Column`
array of separated strings. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR clarifies that collect_list and collect_set return a standard ArrayType s and not some new special list or set type.
Why are the changes needed?
The current description causes confusion about what type this function returns. This change makes it more in line with the Databricks versions for collect_list and collect_set as well.
Does this PR introduce any user-facing change?
It improves the documentation
How was this patch tested?
Not applicable
Was this patch authored or co-authored using generative AI tooling?
No