[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. by jiateoh · Pull Request #55039 · apache/spark

jiateoh · 2026-03-26T20:56:33Z

What changes were proposed in this pull request?

Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes, pass normalized tuples directly to schema.toInternal which handles them by index

To better explain the removal of Row usage:

The positional ordering is retained because Row was constructed purely positionally:
- Row.new stores values in insertion order (link). This is also noted in the change notes since 3.0.0
- dict(zip(...)) preserves insertion order in python 3.7+ (dicts preserve insertion order)
- Inputs to zip are field_names, assumed to be same-ordered as the converted input data which is derived from the original input tuple in the same order.
Row is a tuple subclass, so it always hit Schema.toInternal's tuple/list positional branch. Replacing it with the input tuple or a converted list will execute the same Schema.toInternal branch as before.

The result is that the extra list, zip, dict, and Row construction is no longer necessary: the end result remains equivalent to the input to the entire Row(**dict(zip(...))) sequence: a positionally-ordered tuple/list

AI-assisted + human reviewed/updated

Why are the changes needed?

This is a code cleanup/performance optimization. Original code has unnecessary operations that are executed for every row, including: rebuilding closures, extracting field names, building intermediate lists + dicts, and constructing Row objects (which sort by field unnecessarily). These can all add minor overhead while having no effect on the underlying usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

jiateoh · 2026-03-26T21:09:45Z

+            return serializer

-        field_names = [f.name for f in schema.fields]
-        row_value = Row(**dict(zip(field_names, converted)))


PR Description contains explanation/breakdown of why this is safe: TLDR is that the inputs to Schema.toInternal before/after this change are a tuple/list that is positionally ordered based on the input data.

…list/dict/row construction. - Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes; pass normalized tuples directly to schema.toInternal which handles them by index AI-assisted + human reviewed/edited Generated-by: Claude Code (Claude Opus 4.6)

…r linter

HeartSaVioR

+1 but let's double confirm with one of PySpark folks to confirm the PR is based on the invariants (so it won't be changed in future).

Maybe @zhengruifeng, would you please help taking a look? Thanks in advance!

zhengruifeng

LGTM, it seems the intermediate Row is not necessary.
And Row(**dict(zip(field_names, converted))) doesn't support duplicated field names, while after this change, it should be able to support.

HeartSaVioR · 2026-04-06T20:10:57Z

Thanks! Merging to master. (Sorry for missing this.)

jiateoh commented Mar 26, 2026

View reviewed changes

jiateoh changed the title ~~[WIP] Optimize python TWS stateful processor serialization calls~~ [SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls Mar 27, 2026

jiateoh changed the title ~~[SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls~~ [WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls Mar 27, 2026

jiateoh force-pushed the tws_python_serialization_improvements branch from 9d1665d to 243a49b Compare March 27, 2026 00:30

jiateoh commented Mar 27, 2026

View reviewed changes

Comment thread python/pyspark/sql/streaming/stateful_processor_api_client.py Outdated

jiateoh force-pushed the tws_python_serialization_improvements branch from 6a3582f to 615af5d Compare March 30, 2026 21:54

jiateoh force-pushed the tws_python_serialization_improvements branch from 615af5d to 58b97b1 Compare March 30, 2026 22:17

jiateoh changed the title ~~[WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls~~ [WIP][SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. Mar 30, 2026

jiateoh changed the title ~~[WIP][SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.~~ [SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction. Mar 30, 2026

Change Schema.toInternal serializer input to consistent tuple type fo…

8a3bb7d

…r linter

HeartSaVioR approved these changes Apr 2, 2026

View reviewed changes

zhengruifeng approved these changes Apr 2, 2026

View reviewed changes

HeartSaVioR closed this in ae7f6e3 Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.#55039

[SPARK-56248][PYTHON][SS] Optimize python stateful processor serialization to skip unnecessary list/dict/row construction.#55039
jiateoh wants to merge 2 commits into
apache:masterfrom
jiateoh:tws_python_serialization_improvements

jiateoh commented Mar 26, 2026 •

edited

Loading

Uh oh!

jiateoh Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

HeartSaVioR left a comment

Uh oh!

zhengruifeng left a comment

Uh oh!

HeartSaVioR commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiateoh commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jiateoh Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiateoh commented Mar 26, 2026 •

edited

Loading

jiateoh Mar 26, 2026 •

edited

Loading