spike(rows-first): two-message state_change protocol behind env gate by paddymul · Pull Request #787 · buckaroo-data/buckaroo

paddymul · 2026-05-20T22:39:31Z

Draft / spike — not for merge. Validates the wire shape for "show rows first, compute stats after" without going async. Synchronous server, just message reordering + an IOLoop.call_later yield.

Sits on top of #785 (uses _defer_summary_sd + recompute_summary_sd hooks that this PR adds to the dataflow).

What it proves

A single buckaroo_state_change produces two initial_state frames on the wire.
An infinite_request fired by the client between the two frames is serviced and its parquet rows arrive before phase 2. The call_later(0.01) between phase 1 and phase 2 gives inbound socket reads a real time window to win the race.

Test: tests/unit/server/test_rows_first_spike.py (2 cases).

How to try it

Set the env flag and exercise normally:

BUCKAROO_ROWS_FIRST_SPIKE=1 uv run ...

Default (flag off) keeps today's single-frame rebroadcast. Existing test_load_expr tests pass unchanged.

Open questions

xorq path coverage. XorqServerDataflow may not route through the _defer_summary_sd short-circuit on the base DataFlow. First spike test had to drop a "stats actually differ between phases" assertion because it was failing on xorq. This is the most important item to resolve before un-gating — without that assertion the spike test would still pass if recompute_summary_sd() were a no-op, so we can't detect regression of the spike's whole point.
Inter-phase state_change race. Between phase 1 broadcast and the call_later firing, a second state_change can land. Phase 2 then computes stats against a processed_df that already reflects the second state, but the broadcast payload mixes phase-2 stats with what the user thinks is the first interaction. The spike doesn't try to address this; needs a serialise / cancel-previous strategy before un-gating.
Delay tuning. 10ms is a localhost-only number — loopback WS RTT is well under 1ms. For remote deployments (RTT 30-100ms+) 10ms is shorter than the round-trip and the spike's premise breaks. Real prod tuning must measure end-to-end and pick a value that reliably loses the race to a network-bound infinite_request.
Cache-hit fast path. When filt_sd_key for the new state is already in summary_stats_cache (search-box backspace, oscillation), there's no compute to defer — phase 2 would be near-instant. The spike doesn't short-circuit and emits two frames anyway, which is wasteful.
Phase-2 message type. Both phases currently send type: "initial_state". JS will re-apply full state twice. Worth deciding whether phase 2 should be a distinct type (e.g. stats_update) so the front-end can target a cheaper re-render of just the stats-driven UI, rather than re-running the full initial_state handler.
No before/after timing. PR currently asserts the protocol works but doesn't measure whether it produces a perceptible win on a realistic dataset. One A/B measurement (rows-rendered latency, with/without flag, on a search that triggers a real compute) would justify or kill the protocol.

Stack

feat(pinned-rows): ?key prefix marks an optional pinned row #777 ✅ merged (JS ?key pinned-rows)
feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) #783 ✅ merged (keyed SD cache)
feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd #785 ⏳ open (merge raw + filt scope SDs into prefixed merged_sd)
This ⏳ draft (rows-first two-message protocol; depends on feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd #785)

🤖 Generated with Claude Code

Proves the wire shape for "show rows first, compute stats after" without going async. Synchronous server, just message reordering + an ``IOLoop.call_later`` yield between the two ``initial_state`` frames so any ``infinite_request`` from the client gets serviced in between. Gated on ``BUCKAROO_ROWS_FIRST_SPIKE=1``. Default off so existing ``test_load_expr`` tests (which expect a single rebroadcast frame per state change) keep working unchanged. Mechanism: 1. State-change handler sets ``dataflow._defer_summary_sd = True``, then extracts state. The trait cascade still runs (``cleaned`` → ``processed_result`` → ``populate_df_meta``), but ``_summary_sd`` short-circuits — the analysis-pipeline pass is skipped. 2. Phase 1 ``initial_state`` is sent immediately. 3. ``IOLoop.call_later(0.01, _send_stats_update)`` schedules the deferred compute. The 10ms delay gives any ``infinite_request`` the client fires in response to phase 1 a real time window to arrive and get serviced before the stats compute starts. 4. Phase 2: ``recompute_summary_sd()`` triggers the analysis run (which fires ``_populate_sd_cache`` + ``_merged_sd`` downstream), then a second ``initial_state`` is sent. Spike tests in ``tests/unit/server/test_rows_first_spike.py``: - ``test_state_change_emits_two_initial_state_messages``: pins the two-message wire shape. - ``test_infinite_request_between_phases_returns_rows``: pins the key win — ``infinite_request`` between the two phases is serviced and its parquet frame arrives before phase 2. Not for merge as-is — this is exploration on top of #785 to validate the protocol before doing a real (un-gated) implementation. Open questions: - xorq path: ``XorqServerDataflow`` may not route through this ``_defer_summary_sd`` short-circuit; the first test had to drop a "stats actually differ between phases" assertion that was failing on the xorq backend. - 10ms delay is arbitrary. Real prod tuning would measure end-to-end and pick a value that reliably loses the race to a network-bound ``infinite_request`` without feeling sluggish. - Cache hit fast path: when ``filt_sd_key`` is already in ``summary_stats_cache`` (search-box backspace etc.), there's no reason to two-message. Spike doesn't short-circuit that. Sits on top of PR #785. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Review-driven cleanup on the spike, no behaviour change: - Hoist ``from tornado.ioloop import IOLoop`` out of ``_handle_buckaroo_state_change`` to the module top (CLAUDE.md bans in-function imports). - Extract ``_apply_state_and_broadcast`` so phase 1 and phase 2 share one copy of session-update + component_config merge + per-client send. Phase 1 still owns the ``session.buckaroo_state = new_state`` assignment. - Rewrite the ``_ROWS_FIRST_SPIKE_PHASE2_DELAY_S`` comment so it doesn't claim 10ms beats "typical browser-to-server WS round-trip" (only true for localhost). Flag the remote-deployment caveat. - Drop the dead ``monkeypatch.setenv`` in the spike test fixture — the module-level constant is resolved at import, so only the ``setattr`` actually flips the gate. Spike tests still pass; the flag-off ``test_load_expr`` suite is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul and others added 3 commits May 20, 2026 18:38

chore(spike): drop unused os import in rows-first spike test

f561070

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 21, 2026

buckaroo_state_change requests are not superseded; rapid typing wedges the server #794

Open

Status bar shows no signal that the server / dataflow is still computing after a state_change #813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike(rows-first): two-message state_change protocol behind env gate#787

spike(rows-first): two-message state_change protocol behind env gate#787
paddymul wants to merge 3 commits into
feat/scoped-sd-merged-prefix-v2from
spike/state-change-rows-first

paddymul commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it proves

How to try it

Open questions

Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

paddymul commented May 20, 2026 •

edited

Loading