Skip to content

[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.#55039

Closed
jiateoh wants to merge 2 commits into
apache:masterfrom
jiateoh:tws_python_serialization_improvements
Closed

[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.#55039
jiateoh wants to merge 2 commits into
apache:masterfrom
jiateoh:tws_python_serialization_improvements

Conversation

@jiateoh
Copy link
Copy Markdown
Contributor

@jiateoh jiateoh commented Mar 26, 2026

What changes were proposed in this pull request?

Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes, pass normalized tuples directly to schema.toInternal which handles them by index

To better explain the removal of Row usage:

  • The positional ordering is retained because Row was constructed purely positionally:
    • Row.new stores values in insertion order (link). This is also noted in the change notes since 3.0.0
    • dict(zip(...)) preserves insertion order in python 3.7+ (dicts preserve insertion order)
    • Inputs to zip are field_names, assumed to be same-ordered as the converted input data which is derived from the original input tuple in the same order.
  • Row is a tuple subclass, so it always hit Schema.toInternal's tuple/list positional branch. Replacing it with the input tuple or a converted list will execute the same Schema.toInternal branch as before.

The result is that the extra list, zip, dict, and Row construction is no longer necessary: the end result remains equivalent to the input to the entire Row(**dict(zip(...))) sequence: a positionally-ordered tuple/list

AI-assisted + human reviewed/updated

Why are the changes needed?

This is a code cleanup/performance optimization. Original code has unnecessary operations that are executed for every row, including: rebuilding closures, extracting field names, building intermediate lists + dicts, and constructing Row objects (which sort by field unnecessarily). These can all add minor overhead while having no effect on the underlying usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

return serializer

field_names = [f.name for f in schema.fields]
row_value = Row(**dict(zip(field_names, converted)))
Copy link
Copy Markdown
Contributor Author

@jiateoh jiateoh Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Description contains explanation/breakdown of why this is safe: TLDR is that the inputs to Schema.toInternal before/after this change are a tuple/list that is positionally ordered based on the input data.

@jiateoh jiateoh changed the title [WIP] Optimize python TWS stateful processor serialization calls [SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls Mar 27, 2026
@jiateoh jiateoh changed the title [SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls [WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls Mar 27, 2026
@jiateoh jiateoh force-pushed the tws_python_serialization_improvements branch from 9d1665d to 243a49b Compare March 27, 2026 00:30
Comment thread python/pyspark/sql/streaming/stateful_processor_api_client.py Outdated
@jiateoh jiateoh force-pushed the tws_python_serialization_improvements branch from 6a3582f to 615af5d Compare March 30, 2026 21:54
…list/dict/row construction.

- Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes;
  pass normalized tuples directly to schema.toInternal which handles them by index

AI-assisted + human reviewed/edited

Generated-by: Claude Code (Claude Opus 4.6)
@jiateoh jiateoh force-pushed the tws_python_serialization_improvements branch from 615af5d to 58b97b1 Compare March 30, 2026 22:17
@jiateoh jiateoh changed the title [WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls [WIP][SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. Mar 30, 2026
@jiateoh jiateoh changed the title [WIP][SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. [SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. Mar 30, 2026
Copy link
Copy Markdown
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but let's double confirm with one of PySpark folks to confirm the PR is based on the invariants (so it won't be changed in future).

Maybe @zhengruifeng, would you please help taking a look? Thanks in advance!

Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it seems the intermediate Row is not necessary.
And Row(**dict(zip(field_names, converted))) doesn't support duplicated field names, while after this change, it should be able to support.

@HeartSaVioR
Copy link
Copy Markdown
Contributor

Thanks! Merging to master. (Sorry for missing this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants