Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47002][Python] Return better error message if UDTF 'analyze' method 'orderBy' field accidentally returns a list of strings #45062

Closed
wants to merge 5 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Feb 7, 2024

What changes were proposed in this pull request?

This PR updates the Python UDTF API to check and return a better error message if the analyze method returns an AnalyzeResult object with an orderBy field erroneously set to a list of strings, rather than OrderingColumn instances.

For example, this UDTF accidentally sets the orderBy field in this way:

from pyspark.sql.functions import AnalyzeResult, OrderingColumn, PartitioningColumn
from pyspark.sql.types import IntegerType, Row, StructType
class Udtf:
    def __init__(self):
        self._partition_col = None
        self._count = 0
        self._sum = 0
        self._last = None

    @staticmethod
    def analyze(row: Row):
        return AnalyzeResult(
            schema=StructType()
                .add("user_id", IntegerType())
                .add("count", IntegerType())
                .add("total", IntegerType())
                .add("last", IntegerType()),
            partitionBy=[
                PartitioningColumn("user_id")
            ],
            orderBy=[
                "timestamp"
            ],
            )

    def eval(self, row: Row):
        self._partition_col = row["partition_col"]
        self._count += 1
        self._last = row["input"]
        self._sum += row["input"]

    def terminate(self):
        yield self._partition_col, self._count, self._sum, self._last

Why are the changes needed?

This improves error messages and helps keep users from getting confused.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds test coverage.

Was this patch authored or co-authored using generative AI tooling?

No.

reformat

reformat
@dtenedor
Copy link
Contributor Author

dtenedor commented Feb 8, 2024

cc @ueshin here is the follow-up PR to add more checks for analyze result ordering fields.

@dtenedor
Copy link
Contributor Author

dtenedor commented Feb 8, 2024

The CI looks passing, the one failure is unrelated:

image

@ueshin
Copy link
Member

ueshin commented Feb 8, 2024

Thanks! merging to master.

@ueshin ueshin closed this in 6569f15 Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants