Skip to content

[SPARK-55723][PYTHON] Generalize enforce_schema error to PySparkTypeError#54736

Closed
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55723/fix/enforce-schema-error
Closed

[SPARK-55723][PYTHON] Generalize enforce_schema error to PySparkTypeError#54736
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55723/fix/enforce-schema-error

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Mar 10, 2026

What changes were proposed in this pull request?

Replace PySparkRuntimeError with RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF error class in enforce_schema and ArrowStreamArrowUDTFSerializer with a general PySparkTypeError that reports column name, expected type, and actual type without being specific to any UDF type.

Why are the changes needed?

The RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF error class was UDTF-specific, but enforce_schema is a general utility used across UDF types. The error message ("Column names ... do not match specified schema") was also misleading -- the actual failure is a type cast error, not a column name mismatch.

Does this PR introduce any user-facing change?

Yes. The error type changes from PySparkRuntimeError to PySparkTypeError, and the message now accurately describes the type mismatch:

Before:

PySparkRuntimeError: [RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF] Column names of the returned pyarrow.Table or pyarrow.RecordBatch do not match specified schema. Expected: int32 Actual: string

After:

PySparkTypeError: Result type of column 'id' does not match the expected type. Expected: int32, got: string.

How was this patch tested?

Updated existing test in test_arrow_udtf.py.

Was this patch authored or co-authored using generative AI tooling?

No

@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

cc @gaogaotiantian

@gaogaotiantian
Copy link
Copy Markdown
Contributor

I'm okay with this in general, but what's our policy to remove an error class? Can we just do it? @HyukjinKwon and @zhengruifeng

Comment on lines +326 to +327
raise PySparkTypeError(
f"Result type of column '{field.name}' does not "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to use error class here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing against error class. it's just the previous RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF error class was too UDTF specific.

And this falls perfectly into type error range so I feel creating a new error class is not necessary. Any strong objection on using general PySparkTypeError?

@zhengruifeng
Copy link
Copy Markdown
Contributor

I'm okay with this in general, but what's our policy to remove an error class? Can we just do it? @HyukjinKwon and @zhengruifeng

we don't have a strict policy on removing error class, and I think normally it won't be treated as behavior change (it should be rare case that a job relies on the error class)

I think it is in general fine to remove duplicated ones and reuse existing ones

@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master.

terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
…rror

### What changes were proposed in this pull request?

Replace `PySparkRuntimeError` with `RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF` error class in `enforce_schema` and `ArrowStreamArrowUDTFSerializer` with a general `PySparkTypeError` that reports column name, expected type, and actual type without being specific to any UDF type.

### Why are the changes needed?

The `RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF` error class was UDTF-specific, but `enforce_schema` is a general utility used across UDF types. The error message ("Column names ... do not match specified schema") was also misleading -- the actual failure is a type cast error, not a column name mismatch.

### Does this PR introduce _any_ user-facing change?

Yes. The error type changes from `PySparkRuntimeError` to `PySparkTypeError`, and the message now accurately describes the type mismatch:

**Before:**
```
PySparkRuntimeError: [RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF] Column names of the returned pyarrow.Table or pyarrow.RecordBatch do not match specified schema. Expected: int32 Actual: string
```

**After:**
```
PySparkTypeError: Result type of column 'id' does not match the expected type. Expected: int32, got: string.
```

### How was this patch tested?

Updated existing test in `test_arrow_udtf.py`.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#54736 from Yicong-Huang/SPARK-55723/fix/enforce-schema-error.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants