Skip to content

perf(is-seg): swap pydantic response for dataclass twins on workflow path (stacked on #22)#29

Open
aseembits93 wants to merge 1 commit into
optimize-rfdetr-segfrom
perf/is-seg-workflow-dataclasses-stacked
Open

perf(is-seg): swap pydantic response for dataclass twins on workflow path (stacked on #22)#29
aseembits93 wants to merge 1 commit into
optimize-rfdetr-segfrom
perf/is-seg-workflow-dataclasses-stacked

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

Summary

Stacked on #22. Non-stacked version against main is #28. Open this PR once #22 lands; then its diff against main will be identical to #28's (modulo target-branch differences).

InferenceModelsInstanceSegmentationAdapter.postprocess built a full pydantic tree per frame — Point × V per polygon vertex, InstanceSegmentationPrediction × N, then InstanceSegmentationInferenceResponse. The workflow block then called response.model_dump(by_alias=True, exclude_none=True) to get a plain dict for sv.Detections.from_inference. Neither the validation nor the serializer is needed on that path — the block only consumes the dict.

This change adds slotted dataclass twins (PointDC, InferenceResponseImageDC, InstanceSegmentationPredictionDC, InstanceSegmentationInferenceResponseDC) plus _is_pred_dc_to_dict / _is_response_dc_to_dict helpers that emit the exact dict model_dump(by_alias=True, exclude_none=True) produces.

The adapter gates on kwargs.get("source") == "workflow-execution" and returns the dataclass response on that path. Every other caller — HTTP response_model at http_api.py:1640, isinstance-based cache dispatch at cache/serializers.py:71, draw_predictions visualization — keeps the pydantic path untouched.

The v3 workflow block detects the dataclass via isinstance and calls _is_response_dc_to_dict; falls back to model_dump for any other response type.

Numbers

Microbench (4 dets × 6-vertex polygon, construct + dump):

µs/frame
pydantic ~81
dataclass ~34

Δ ≈ 2.4× faster.

End-to-end on top of #22's full stack (rfdetr-seg-nano TRT + Triton preproc + Triton fullpost + CUDA graphs, vehicles_312px.mp4, 538 frames, 4 runs each):

run 1 run 2 run 3 run 4 mean
baseline (optimize-rfdetr-seg HEAD) 152.55 152.62 153.23 153.33 152.93
this change 156.33 156.88 157.10 155.86 156.54

Δ ≈ +3.6 FPS (+2.4%).

Correctness

Bit-exact parity: _is_response_dc_to_dict(dc) == pyd.model_dump(by_alias=True, exclude_none=True) on mixed inputs including varying polygon lengths, empty polygons, and post-construction mutation of .time / .inference_id (assigned by Model.infer_from_request at inference/core/models/base.py:154-157). @dataclass(slots=True) permits the reassignment because those fields are declared.

Test plan

  • pytest tests/workflows/unit_tests/core_steps/models/roboflow/instance_segmentation/test_v3.py — 23/23 pass
  • Parity test (_is_response_dc_to_dict(dc) == pyd.model_dump(by_alias=True, exclude_none=True))
  • Microbench
  • End-to-end FPS benchmark on optimize-rfdetr-seg's full Triton stack
  • HTTP regression: hit /infer/instance_segmentation locally and confirm JSON is byte-identical to pre-change (expected — adapter falls through to pydantic when source != "workflow-execution")

🤖 Generated with Claude Code

…path

Stacks on top of PR#22 (optimize-rfdetr-seg: Triton fusion + CUDA graphs
+ scratch caching). See PR#28 for the same change against main.

`InferenceModelsInstanceSegmentationAdapter.postprocess` built a full
pydantic tree per frame — `Point × V` per polygon vertex,
`InstanceSegmentationPrediction × N`, then
`InstanceSegmentationInferenceResponse`. The workflow block then called
`response.model_dump(by_alias=True, exclude_none=True)` to get a plain
dict for `sv.Detections.from_inference`. Neither validation nor the
serializer is needed on that path — the block only consumes the dict.

This change adds slotted dataclass twins (`PointDC`,
`InferenceResponseImageDC`, `InstanceSegmentationPredictionDC`,
`InstanceSegmentationInferenceResponseDC`) plus `_is_pred_dc_to_dict`
and `_is_response_dc_to_dict` helpers that emit the exact dict
`model_dump(by_alias=True, exclude_none=True)` produces (same keys,
same `class` alias, same None-omission).

The adapter gates on `kwargs.get("source") == "workflow-execution"`
and returns the dataclass response on that path. Every other caller —
HTTP `response_model` at `http_api.py:1640`, `isinstance`-based cache
dispatch at `cache/serializers.py:71`, `draw_predictions`
visualization — keeps the pydantic path untouched.

The v3 workflow block detects the dataclass via `isinstance` and calls
`_is_response_dc_to_dict`; falls back to `model_dump` for any other
response type.

Microbench (4 dets × 6-vertex polygon, construct + dump):
  * pydantic:  ~81 us/frame
  * dataclass: ~34 us/frame  (2.43x faster)

End-to-end (rfdetr-seg-nano TRT + Triton preproc + Triton fullpost +
CUDA graphs, vehicles_312px.mp4, 538 frames, 4 runs each, on top of
optimize-rfdetr-seg HEAD c1406a8):
  * baseline (pydantic): 152.93 FPS mean
  * dataclass:           156.54 FPS mean  (+3.6 FPS, +2.4%)

Bit-exact parity verified: `_is_response_dc_to_dict(dc)` byte-equals
`pyd.model_dump(by_alias=True, exclude_none=True)` for mixed inputs
(varying polygon lengths, empty list, mutation of .time/.inference_id
post-construct).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants