Skip to content

_operation_result re-fires itself via self.operations = result[3], doubling stats work on every quick_command_args change #780

@paddymul

Description

@paddymul

Problem

_operation_result in buckaroo/dataflow/dataflow.py is registered as an @observe callback on ('sampled_df', 'cleaning_method', 'quick_command_args', 'operations'). The body of the method writes to a trait it observes:

self.cleaned = result
self.operations = result[3]   # <-- re-fires the same observer

This causes the observer to fire twice per quick_command_args change: once for the external change, once for the inner reassignment. Every downstream observer (cleaned → processed_result → _summary_sd → populate_df_meta) runs twice. The existing (id(df), id(klasses)) dedupe in _summary_sd (added per #709 for widget construction) can't help because the second pass legitimately produces a fresh processed_df object id.

The docstring already flags the hazard:

Obviously this can lead to cycles so this code must be approached carefully.

…but the guard never landed.

Reproduction

Test data: 500k-row citibike parquet (~34MB, 13 columns). Sending buckaroo_state_change with quick_command_args = {"search": ["elect"]} from a WS client, measuring round-trip:

Backend state_change post-search page total
pandas (POST /load mode="buckaroo") 1369ms 4ms 1374ms
xorq (POST /load_expr, #776) 3970ms 139ms 4118ms
polars (POST /load mode="lazy") (timed out — see "out of scope" below)

For comparison, the equivalent naive df[c].str.contains('elect') across all 13 string columns standalone (no widget) is 505ms. So the widget overhead is ~2.7× even for pandas; this issue accounts for half of that.

Per-method timing during the xorq state_change (instrumentation reverted):

xorq backend, search "elect":
  _get_summary_sd:                    1837ms  processed_df_id=...578960
  populate_df_meta _expr_count(proc): 117ms
  populate_df_meta _expr_count(orig): 2ms
  [re-enters via self.operations = result[3]]
  _get_summary_sd:                    1846ms  processed_df_id=...066896  ← DUPLICATE
  populate_df_meta _expr_count(proc): 117ms                              ← DUPLICATE
  populate_df_meta _expr_count(orig): 2ms
  total:                              3970ms

Both passes produce the same logical result; the second is pure duplicate work.

Surfaced by the pydata-app companion (https://github.com/paddymul/pydata-app) once #776 wired up /load_expr and a 500k-row entry hit the search box.

Proposed fix

Reentry guard in _operation_result:

     def _operation_result(self, _change:Any) -> None:
         \"\"\"...
+        Concretely: this method sets ``self.operations = result[3]`` while
+        also observing the ``operations`` trait, so the observer re-fires
+        inside its own execution. Without the reentry guard below, every
+        downstream observer (cleaned → processed_result → _summary_sd →
+        populate_df_meta) runs twice per single ``quick_command_args``
+        change — measured as ~2× search latency for a 500k-row xorq frame.
         \"\"\"
+        if getattr(self, '_in_operation_result', False):
+            return
+        self._in_operation_result = True
+        try:
             result = self.ac_obj.handle_ops_and_clean(
                 self.sampled_df, self.cleaning_method, self.quick_command_args, self.operations)

             if result is None:
                 return
             else:
                 self.cleaned = result
                 self.operations = result[3]
             self.operation_results = {'transformed_df':None,
                 'generated_py_code': self.generated_code}
+        finally:
+            self._in_operation_result = False

Measured after fix:

Backend state_change total
pandas 704ms (was 1369ms) 708ms (was 1374ms)
xorq 1919ms (was 3970ms) 2061ms (was 4118ms)

Roughly 50% reduction across the board. 943 of 944 buckaroo unit tests pass (the one failure is tests/unit/server/test_mcp_uvx_install.py::test_uvx_no_stdout_pollution — unrelated to the dataflow change).

Why this matters

  • Halves search latency on every catalog-entry view with a non-empty filter input — the workload the audience hits the moment they type into the search box.
  • Removes pure duplicate work; no behavioural change. Both passes today produce equivalent state.
  • Necessary precondition for any per-column-stats optimization downstream (otherwise improvements get halved by the duplicate fire).

Out of scope

  • polars/lazy search isn't implemented server-side at all. _handle_buckaroo_state_change early-returns for mode != "buckaroo" (websocket_handler.py:54) and handle_infinite_request_lazy doesn't read search from payload_args (data_loading.py:164). My probe above timed out for that reason, not because of this bug. Worth a separate issue.
  • Remaining 1.8s in xorq is one _get_summary_sd call iterating per-column. Probably wants a future ticket: batch all column stats into one xorq query, or skip the per-column histogram refresh when only quick_command_args.search changed (schema hasn't actually moved).

Happy to PR this as a fixup on #776 or as a standalone PR against main — whichever you prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions