Skip to content

spike(rows-first): two-message state_change protocol behind env gate#787

Draft
paddymul wants to merge 3 commits into
feat/scoped-sd-merged-prefix-v2from
spike/state-change-rows-first
Draft

spike(rows-first): two-message state_change protocol behind env gate#787
paddymul wants to merge 3 commits into
feat/scoped-sd-merged-prefix-v2from
spike/state-change-rows-first

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

@paddymul paddymul commented May 20, 2026

Draft / spike — not for merge. Validates the wire shape for "show rows first, compute stats after" without going async. Synchronous server, just message reordering + an IOLoop.call_later yield.

Sits on top of #785 (uses _defer_summary_sd + recompute_summary_sd hooks that this PR adds to the dataflow).

What it proves

  • A single buckaroo_state_change produces two initial_state frames on the wire.
  • An infinite_request fired by the client between the two frames is serviced and its parquet rows arrive before phase 2. The call_later(0.01) between phase 1 and phase 2 gives inbound socket reads a real time window to win the race.

Test: tests/unit/server/test_rows_first_spike.py (2 cases).

How to try it

Set the env flag and exercise normally:

BUCKAROO_ROWS_FIRST_SPIKE=1 uv run ...

Default (flag off) keeps today's single-frame rebroadcast. Existing test_load_expr tests pass unchanged.

Open questions

  • xorq path coverage. XorqServerDataflow may not route through the _defer_summary_sd short-circuit on the base DataFlow. First spike test had to drop a "stats actually differ between phases" assertion because it was failing on xorq. This is the most important item to resolve before un-gating — without that assertion the spike test would still pass if recompute_summary_sd() were a no-op, so we can't detect regression of the spike's whole point.
  • Inter-phase state_change race. Between phase 1 broadcast and the call_later firing, a second state_change can land. Phase 2 then computes stats against a processed_df that already reflects the second state, but the broadcast payload mixes phase-2 stats with what the user thinks is the first interaction. The spike doesn't try to address this; needs a serialise / cancel-previous strategy before un-gating.
  • Delay tuning. 10ms is a localhost-only number — loopback WS RTT is well under 1ms. For remote deployments (RTT 30-100ms+) 10ms is shorter than the round-trip and the spike's premise breaks. Real prod tuning must measure end-to-end and pick a value that reliably loses the race to a network-bound infinite_request.
  • Cache-hit fast path. When filt_sd_key for the new state is already in summary_stats_cache (search-box backspace, oscillation), there's no compute to defer — phase 2 would be near-instant. The spike doesn't short-circuit and emits two frames anyway, which is wasteful.
  • Phase-2 message type. Both phases currently send type: "initial_state". JS will re-apply full state twice. Worth deciding whether phase 2 should be a distinct type (e.g. stats_update) so the front-end can target a cheaper re-render of just the stats-driven UI, rather than re-running the full initial_state handler.
  • No before/after timing. PR currently asserts the protocol works but doesn't measure whether it produces a perceptible win on a realistic dataset. One A/B measurement (rows-rendered latency, with/without flag, on a search that triggers a real compute) would justify or kill the protocol.

Stack

🤖 Generated with Claude Code

paddymul and others added 3 commits May 20, 2026 18:38
Proves the wire shape for "show rows first, compute stats after"
without going async. Synchronous server, just message reordering +
an ``IOLoop.call_later`` yield between the two ``initial_state``
frames so any ``infinite_request`` from the client gets serviced in
between.

Gated on ``BUCKAROO_ROWS_FIRST_SPIKE=1``. Default off so existing
``test_load_expr`` tests (which expect a single rebroadcast frame
per state change) keep working unchanged.

Mechanism:

1. State-change handler sets ``dataflow._defer_summary_sd = True``,
   then extracts state. The trait cascade still runs (``cleaned`` →
   ``processed_result`` → ``populate_df_meta``), but ``_summary_sd``
   short-circuits — the analysis-pipeline pass is skipped.
2. Phase 1 ``initial_state`` is sent immediately.
3. ``IOLoop.call_later(0.01, _send_stats_update)`` schedules the
   deferred compute. The 10ms delay gives any ``infinite_request``
   the client fires in response to phase 1 a real time window to
   arrive and get serviced before the stats compute starts.
4. Phase 2: ``recompute_summary_sd()`` triggers the analysis run
   (which fires ``_populate_sd_cache`` + ``_merged_sd`` downstream),
   then a second ``initial_state`` is sent.

Spike tests in ``tests/unit/server/test_rows_first_spike.py``:

- ``test_state_change_emits_two_initial_state_messages``: pins the
  two-message wire shape.
- ``test_infinite_request_between_phases_returns_rows``: pins the
  key win — ``infinite_request`` between the two phases is serviced
  and its parquet frame arrives before phase 2.

Not for merge as-is — this is exploration on top of #785 to validate
the protocol before doing a real (un-gated) implementation. Open
questions:

- xorq path: ``XorqServerDataflow`` may not route through this
  ``_defer_summary_sd`` short-circuit; the first test had to drop a
  "stats actually differ between phases" assertion that was failing
  on the xorq backend.
- 10ms delay is arbitrary. Real prod tuning would measure
  end-to-end and pick a value that reliably loses the race to a
  network-bound ``infinite_request`` without feeling sluggish.
- Cache hit fast path: when ``filt_sd_key`` is already in
  ``summary_stats_cache`` (search-box backspace etc.), there's no
  reason to two-message. Spike doesn't short-circuit that.

Sits on top of PR #785.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review-driven cleanup on the spike, no behaviour change:

- Hoist ``from tornado.ioloop import IOLoop`` out of
  ``_handle_buckaroo_state_change`` to the module top (CLAUDE.md
  bans in-function imports).
- Extract ``_apply_state_and_broadcast`` so phase 1 and phase 2
  share one copy of session-update + component_config merge +
  per-client send. Phase 1 still owns the
  ``session.buckaroo_state = new_state`` assignment.
- Rewrite the ``_ROWS_FIRST_SPIKE_PHASE2_DELAY_S`` comment so it
  doesn't claim 10ms beats "typical browser-to-server WS round-trip"
  (only true for localhost). Flag the remote-deployment caveat.
- Drop the dead ``monkeypatch.setenv`` in the spike test fixture —
  the module-level constant is resolved at import, so only the
  ``setattr`` actually flips the gate.

Spike tests still pass; the flag-off ``test_load_expr`` suite is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant