Skip to content

[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787

Closed
landlord-matt wants to merge 4 commits intoapache:masterfrom
landlord-matt:patch-1
Closed

[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787
landlord-matt wants to merge 4 commits intoapache:masterfrom
landlord-matt:patch-1

Conversation

@landlord-matt
Copy link

@landlord-matt landlord-matt commented Nov 13, 2023

What changes were proposed in this pull request?

This PR clarifies that collect_list and collect_set return a standard ArrayType s and not some new special list or set type.

Why are the changes needed?

The current description causes confusion about what type this function returns. This change makes it more in line with the Databricks versions for collect_list and collect_set as well.

Does this PR introduce any user-facing change?

It improves the documentation

How was this patch tested?

Not applicable

Was this patch authored or co-authored using generative AI tooling?

No

@landlord-matt landlord-matt changed the title Documenation: Clarify collect_list -> ArrayType [MINOR][DOCS]: Clarify collect_list -> ArrayType Nov 14, 2023
@HyukjinKwon HyukjinKwon changed the title [MINOR][DOCS]: Clarify collect_list -> ArrayType [MINOR][DOCS] Clarify collect_list -> ArrayType Nov 15, 2023
@landlord-matt
Copy link
Author

landlord-matt commented Nov 15, 2023

@HyukjinKwon: I now also did a similar thing for collect_set. Any comments on the proposal? Both the idea of clarifying this and the solution. I thought you would appreciate this suggestion :'(

@landlord-matt landlord-matt changed the title [MINOR][DOCS] Clarify collect_list -> ArrayType [MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType Nov 15, 2023
@zhengruifeng
Copy link
Contributor

I personally think using list is fine, array in python reminds me of python built-in array library and numpy's array

@landlord-matt
Copy link
Author

landlord-matt commented Nov 17, 2023

@zhengruifeng: Interesting, because the problem I had was that Python has the built in types list and set and they have different properties. If I declare a variable as a set, it will ignore duplicates going forward, but if I declare it a list, it will accept it. In Java/Scala they differentiate arrays, lists and sets.

I think the naming refers to that the collection step behaves like a list or a set, but it returns a standard array and will behave like a standard array in subsequent steps. If you have memorized the available types in Spark, it is perhaps obvious that it will be an ArrayType, but I think I think the documentation should be clear on the specific return type for beginners as well.

@landlord-matt
Copy link
Author

This is how other functions have solved this

@_try_remote_functions
def find_in_set(str: "ColumnOrName", str_array: "ColumnOrName") -> Column:
    """
    Returns the index (1-based) of the given string (`str`) in the comma-delimited
    list (`strArray`). Returns 0, if the string was not found or if the given string (`str`)
    contains a comma.

    .. versionadded:: 3.5.0
def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
    """
    Splits str around matches of the given pattern.
    Returns
    -------
    :class:`~pyspark.sql.Column`
        array of separated strings.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 26, 2024
@github-actions github-actions bot closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants