[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.#55039
Conversation
| return serializer | ||
|
|
||
| field_names = [f.name for f in schema.fields] | ||
| row_value = Row(**dict(zip(field_names, converted))) |
There was a problem hiding this comment.
PR Description contains explanation/breakdown of why this is safe: TLDR is that the inputs to Schema.toInternal before/after this change are a tuple/list that is positionally ordered based on the input data.
9d1665d to
243a49b
Compare
6a3582f to
615af5d
Compare
…list/dict/row construction. - Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes; pass normalized tuples directly to schema.toInternal which handles them by index AI-assisted + human reviewed/edited Generated-by: Claude Code (Claude Opus 4.6)
615af5d to
58b97b1
Compare
HeartSaVioR
left a comment
There was a problem hiding this comment.
+1 but let's double confirm with one of PySpark folks to confirm the PR is based on the invariants (so it won't be changed in future).
Maybe @zhengruifeng, would you please help taking a look? Thanks in advance!
zhengruifeng
left a comment
There was a problem hiding this comment.
LGTM, it seems the intermediate Row is not necessary.
And Row(**dict(zip(field_names, converted))) doesn't support duplicated field names, while after this change, it should be able to support.
|
Thanks! Merging to master. (Sorry for missing this.) |
What changes were proposed in this pull request?
Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes, pass normalized tuples directly to schema.toInternal which handles them by index
To better explain the removal of
Rowusage:dict(zip(...))preserves insertion order in python 3.7+ (dicts preserve insertion order)zipare field_names, assumed to be same-ordered as theconvertedinput data which is derived from the original input tuple in the same order.Rowis a tuple subclass, so it always hit Schema.toInternal's tuple/list positional branch. Replacing it with the input tuple or a converted list will execute the same Schema.toInternal branch as before.The result is that the extra list, zip, dict, and Row construction is no longer necessary: the end result remains equivalent to the input to the entire
Row(**dict(zip(...)))sequence: a positionally-ordered tuple/listAI-assisted + human reviewed/updated
Why are the changes needed?
This is a code cleanup/performance optimization. Original code has unnecessary operations that are executed for every row, including: rebuilding closures, extracting field names, building intermediate lists + dicts, and constructing Row objects (which sort by field unnecessarily). These can all add minor overhead while having no effect on the underlying usage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)