[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType by landlord-matt · Pull Request #43787 · apache/spark

landlord-matt · 2023-11-13T15:35:53Z

What changes were proposed in this pull request?

This PR clarifies that collect_list and collect_set return a standard ArrayType s and not some new special list or set type.

Why are the changes needed?

The current description causes confusion about what type this function returns. This change makes it more in line with the Databricks versions for collect_list and collect_set as well.

Does this PR introduce any user-facing change?

It improves the documentation

How was this patch tested?

Not applicable

Was this patch authored or co-authored using generative AI tooling?

No

landlord-matt · 2023-11-15T09:00:08Z

@HyukjinKwon: I now also did a similar thing for collect_set. Any comments on the proposal? Both the idea of clarifying this and the solution. I thought you would appreciate this suggestion :'(

zhengruifeng · 2023-11-16T10:55:24Z

I personally think using list is fine, array in python reminds me of python built-in array library and numpy's array

landlord-matt · 2023-11-17T13:53:53Z

@zhengruifeng: Interesting, because the problem I had was that Python has the built in types list and set and they have different properties. If I declare a variable as a set, it will ignore duplicates going forward, but if I declare it a list, it will accept it. In Java/Scala they differentiate arrays, lists and sets.

I think the naming refers to that the collection step behaves like a list or a set, but it returns a standard array and will behave like a standard array in subsequent steps. If you have memorized the available types in Spark, it is perhaps obvious that it will be an ArrayType, but I think I think the documentation should be clear on the specific return type for beginners as well.

landlord-matt · 2023-11-17T14:10:50Z

This is how other functions have solved this

@_try_remote_functions
def find_in_set(str: "ColumnOrName", str_array: "ColumnOrName") -> Column:
    """
    Returns the index (1-based) of the given string (`str`) in the comma-delimited
    list (`strArray`). Returns 0, if the string was not found or if the given string (`str`)
    contains a comma.

    .. versionadded:: 3.5.0

def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
    """
    Splits str around matches of the given pattern.
    Returns
    -------
    :class:`~pyspark.sql.Column`
        array of separated strings.

github-actions · 2024-02-26T00:18:40Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Clarify collect_list = array

29d137c

github-actions bot added SQL PYTHON labels Nov 13, 2023

landlord-matt changed the title ~~Documenation: Clarify collect_list -> ArrayType~~ [MINOR][DOCS]: Clarify collect_list -> ArrayType Nov 14, 2023

HyukjinKwon changed the title ~~[MINOR][DOCS]: Clarify collect_list -> ArrayType~~ [MINOR][DOCS] Clarify collect_list -> ArrayType Nov 15, 2023

Same thing for collect_set

2ac726b

landlord-matt changed the title ~~[MINOR][DOCS] Clarify collect_list -> ArrayType~~ [MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType Nov 15, 2023

Write as full sentence instead of vague paranthesis

51d2bc2

Also update agg_array

6f51b9a

landlord-matt mentioned this pull request Nov 23, 2023

[MINOR][DOCS] Clarify sort behaviour for structs #43871

Closed

github-actions bot added the Stale label Feb 26, 2024

github-actions bot closed this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787

[MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType#43787
landlord-matt wants to merge 4 commits intoapache:masterfrom
landlord-matt:patch-1

landlord-matt commented Nov 13, 2023 •

edited

Loading

Uh oh!

landlord-matt commented Nov 15, 2023 •

edited

Loading

Uh oh!

zhengruifeng commented Nov 16, 2023

Uh oh!

landlord-matt commented Nov 17, 2023 •

edited

Loading

Uh oh!

landlord-matt commented Nov 17, 2023

Uh oh!

github-actions bot commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

landlord-matt commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

landlord-matt commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Nov 16, 2023

Uh oh!

landlord-matt commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landlord-matt commented Nov 17, 2023

Uh oh!

github-actions bot commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

landlord-matt commented Nov 13, 2023 •

edited

Loading

landlord-matt commented Nov 15, 2023 •

edited

Loading

landlord-matt commented Nov 17, 2023 •

edited

Loading