[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

dtenedor · 2024-02-07T22:01:35Z

What changes were proposed in this pull request?

This PR updates the Python UDTF API to check and return a better error message if the analyze method returns an AnalyzeResult object with an orderBy field erroneously set to a list of strings, rather than OrderingColumn instances.

For example, this UDTF accidentally sets the orderBy field in this way:

from pyspark.sql.functions import AnalyzeResult, OrderingColumn, PartitioningColumn
from pyspark.sql.types import IntegerType, Row, StructType
class Udtf:
    def __init__(self):
        self._partition_col = None
        self._count = 0
        self._sum = 0
        self._last = None

    @staticmethod
    def analyze(row: Row):
        return AnalyzeResult(
            schema=StructType()
                .add("user_id", IntegerType())
                .add("count", IntegerType())
                .add("total", IntegerType())
                .add("last", IntegerType()),
            partitionBy=[
                PartitioningColumn("user_id")
            ],
            orderBy=[
                "timestamp"
            ],
            )

    def eval(self, row: Row):
        self._partition_col = row["partition_col"]
        self._count += 1
        self._last = row["input"]
        self._sum += row["input"]

    def terminate(self):
        yield self._partition_col, self._count, self._sum, self._last

Why are the changes needed?

This improves error messages and helps keep users from getting confused.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds test coverage.

Was this patch authored or co-authored using generative AI tooling?

No.

reformat reformat

dtenedor · 2024-02-08T00:00:15Z

cc @ueshin here is the follow-up PR to add more checks for analyze result ordering fields.

dtenedor · 2024-02-08T18:41:20Z

The CI looks passing, the one failure is unrelated:

ueshin · 2024-02-08T19:16:37Z

Thanks! merging to master.

dtenedor added 3 commits February 7, 2024 13:09

commit

aa2981b

commit

86ab2a2

merge from master

a73b2e6

github-actions bot added SQL PYTHON labels Feb 7, 2024

commit

afb7a44

reformat reformat

dtenedor force-pushed the check-udtf-sort-columns branch from a00a84c to afb7a44 Compare February 7, 2024 23:29

fix test

20af3c1

ueshin approved these changes Feb 8, 2024

View reviewed changes

ueshin closed this in 6569f15 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

dtenedor commented Feb 7, 2024 •

edited

dtenedor commented Feb 8, 2024

dtenedor commented Feb 8, 2024 •

edited

ueshin commented Feb 8, 2024

[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

Conversation

dtenedor commented Feb 7, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor commented Feb 8, 2024

dtenedor commented Feb 8, 2024 • edited

ueshin commented Feb 8, 2024

dtenedor commented Feb 7, 2024 •

edited

dtenedor commented Feb 8, 2024 •

edited