[SPARK-55723][PYTHON] Generalize enforce_schema error to PySparkTypeError#54736
[SPARK-55723][PYTHON] Generalize enforce_schema error to PySparkTypeError#54736Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Conversation
|
I'm okay with this in general, but what's our policy to remove an error class? Can we just do it? @HyukjinKwon and @zhengruifeng |
| raise PySparkTypeError( | ||
| f"Result type of column '{field.name}' does not " |
There was a problem hiding this comment.
We don't want to use error class here?
There was a problem hiding this comment.
nothing against error class. it's just the previous RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF error class was too UDTF specific.
And this falls perfectly into type error range so I feel creating a new error class is not necessary. Any strong objection on using general PySparkTypeError?
we don't have a strict policy on removing error class, and I think normally it won't be treated as behavior change (it should be rare case that a job relies on the error class) I think it is in general fine to remove duplicated ones and reuse existing ones |
|
Merged to master. |
…rror
### What changes were proposed in this pull request?
Replace `PySparkRuntimeError` with `RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF` error class in `enforce_schema` and `ArrowStreamArrowUDTFSerializer` with a general `PySparkTypeError` that reports column name, expected type, and actual type without being specific to any UDF type.
### Why are the changes needed?
The `RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF` error class was UDTF-specific, but `enforce_schema` is a general utility used across UDF types. The error message ("Column names ... do not match specified schema") was also misleading -- the actual failure is a type cast error, not a column name mismatch.
### Does this PR introduce _any_ user-facing change?
Yes. The error type changes from `PySparkRuntimeError` to `PySparkTypeError`, and the message now accurately describes the type mismatch:
**Before:**
```
PySparkRuntimeError: [RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTF] Column names of the returned pyarrow.Table or pyarrow.RecordBatch do not match specified schema. Expected: int32 Actual: string
```
**After:**
```
PySparkTypeError: Result type of column 'id' does not match the expected type. Expected: int32, got: string.
```
### How was this patch tested?
Updated existing test in `test_arrow_udtf.py`.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#54736 from Yicong-Huang/SPARK-55723/fix/enforce-schema-error.
Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Replace
PySparkRuntimeErrorwithRESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTFerror class inenforce_schemaandArrowStreamArrowUDTFSerializerwith a generalPySparkTypeErrorthat reports column name, expected type, and actual type without being specific to any UDF type.Why are the changes needed?
The
RESULT_COLUMNS_MISMATCH_FOR_ARROW_UDTFerror class was UDTF-specific, butenforce_schemais a general utility used across UDF types. The error message ("Column names ... do not match specified schema") was also misleading -- the actual failure is a type cast error, not a column name mismatch.Does this PR introduce any user-facing change?
Yes. The error type changes from
PySparkRuntimeErrortoPySparkTypeError, and the message now accurately describes the type mismatch:Before:
After:
How was this patch tested?
Updated existing test in
test_arrow_udtf.py.Was this patch authored or co-authored using generative AI tooling?
No