chore: Merge back changes from upstream by OlegWock · Pull Request #24 · deepnote/vegafusion

OlegWock · 2026-03-17T13:55:52Z

Merge without squashing

This PR brings recent changes from upstream and reconciles git histories of upstream and fork. I marked all fork-specific changes with a comment

Resolves BLU-5815

Summary by CodeRabbit

Release Notes

New Features
- Added pluggable plan resolver system for custom query execution paths
- Introduced ExternalDataset support for deferred data loading
- Added SQL unparsing capabilities across multiple SQL dialects
- Enhanced string type handling across Arrow Utf8, LargeUtf8, and Utf8View variants
Dependencies
- Upgraded DataFusion to 52.1.0 for improved performance and compatibility
- Updated Tokio, PyO3, and related ecosystem crates to latest versions
- Added tonic-prost for enhanced protobuf integration
Improvements
- Streamlined architecture documentation across all modules
- Enhanced Python PyO3 interoperability with improved type safety
- Added automated DataFusion proto file generation and management

* Add hierarchical CLAUDE.md structure for Claude Code optimization Refactors the root CLAUDE.md to be more concise and adds per-crate CLAUDE.md files that are loaded on-demand when working in each crate. Key improvements: - Document critical constraints (feature flag conflicts, proto workflow) - Add common debugging scenarios - Include crate-specific patterns (GIL management, async transforms, etc.) - Reduce token usage by moving details to per-crate files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Migrate Intel macOS CI from macos-13 to macos-15-intel GitHub deprecated macos-13 runners on December 8, 2025 with brownout periods starting September 22, 2025. This caused osx-64 builds to fail randomly during brownout windows. macos-15-intel provides Intel x86_64 macOS builds and will be supported until Fall 2027. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix PyO3 API compatibility for ARM64 cross-compilation Replace PyDict::from_sequence with IntoPyDict trait for creating keyword arguments. The from_sequence method was removed from the PyDict API in the maturin cross-compilation Docker image. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix JupyterLab version mismatch in CI tests Update JupyterLab and related packages to resolve version conflicts between voila's bundled @jupyterlab/* packages (4.2.x) and jupytext's extension requirements (^4.4.9+). Changes: - Update jupyterlab from 4.0.5 to >=4.4,<5 (resolves to 4.5.1) - Update jupytext from 1.15.0 to >=1.16,<2 (resolves to 1.18.1) - Update voila from 0.5.0 to >=0.5,<1 - Add pandas and ipykernel to build-dependencies - Modify test-vegafusion-python-linux-64 to use pixi instead of pip for dependencies, ensuring consistent JupyterLab versions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Upgrade to DataFusion 49.0.0 - Update DataFusion deps from 48 (git patches) to 49.0.0 (crates.io) - Update Arrow from 55.1.0 to 55.2.0 - Remove all [patch.crates-io] git patches (11 total) - Fix AggregateFunctionParams.order_by API change: None -> vec![] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Upgrade to DataFusion 50.1.0 - Update DataFusion to 50.1.0, Arrow to 56.0.0 - Update sqlparser to 0.58.0, pyo3-arrow to 0.11.0 - Add distinct and filter fields to WindowFunctionParams in 6 transform files - Add PartialEq, Eq, Hash derives to ScalarUDFImpl types (MakeTimestamptzUDF, TimeunitStartUDF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Upgrade to DataFusion 51.0.0 - Update workspace dependencies to DataFusion 51.0.0 and Arrow 57.0.0 - Update sqlparser to 0.59.0 and pyo3-arrow to 0.15.0 - Add sql feature to datafusion-functions-nested for BinaryOperator compatibility - Update NullTreatment import from datafusion_expr::utils to datafusion_functions::utils - Adapt to pyo3-arrow 0.15 API (arro3-core ArrayIterator) - Add TaskContext parameter to logical_plan_from_bytes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix PyO3 0.27 deprecation warnings Replace deprecated PyO3 APIs with their modern equivalents: - PyObject → Py<PyAny> (21 occurrences) - Python::with_gil → Python::attach (6 occurrences) - py.allow_threads → py.detach (5 occurrences) - downcast/downcast_bound → cast/cast_bound (2 occurrences) Remove #![allow(deprecated)] suppression now that all deprecated APIs have been replaced. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Fix strict equality for mixed Arrow string types (vega#572) Polars DataFrames use Utf8View for string columns, while VegaFusion string literals compile to Utf8. The strict equality operators (===, !==) were checking for exact type equality, causing string comparisons to fail when types differed even if values matched. This fix allows === and !== to compare any Arrow string types (Utf8, LargeUtf8, Utf8View) correctly. Also fixes !== to return true (not false) when types are incompatible. Closes vega#572 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Handle Utf8View and mixed Arrow string types consistently - Fix memory.rs to handle Utf8View when computing memory size - Fix pivot transform to use as_string() for all string array types - Fix vl_selection_test TryFrom implementations for Op, SelectionType, and FieldSpec to handle Utf8View alongside Utf8/LargeUtf8 - Update equality.rs to handle mixed string types in datetime comparison - Add comprehensive tests for string type handling in pivot and selection - Document Arrow string type handling guidelines in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Handle Utf8View consistently across JSON nesting and string extraction * remove pre_processing string types * Move extract_string_values helpers to vegafusion-common::datatypes Consolidates the three-way Arrow string type downcast into shared helpers using the AsArray trait, eliminating duplication across crates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Upgrade to DataFusion 52 and Arrow 57.3 (vega#581) Upgrade DataFusion from 51.0.0 to 52.1.0 and Arrow from 57.0.0 to 57.3.0. This resolves issue vega#577 (Polars Enum/Utf8View dictionary casting) via the arrow-cast 57.3.0 fix for cast_to_dictionary with Utf8View values. Dependency changes: - DataFusion (all crates): 51.0.0 -> 52.1.0 - Arrow: 57.0.0 -> 57.3.0 - tonic/tonic-web/tonic-build: 0.12.3 -> 0.14.4 - prost/prost-types/prost-build: 0.13.3 -> 0.14.1 - itertools: 0.12.1 -> 0.14.0 - tokio: 1.36.0 -> 1.48.0, chrono: 0.4.35 -> 0.4.42 - New deps: tonic-prost 0.14.4, tonic-prost-build 0.14.4 Breaking change fixes: - Replace itertools::sorted free function with .sorted() method (8 files) - Remove NullTreatment from UDAFs for DF52 compatibility (6 files) - Migrate build.rs to tonic-prost-build (2 files) - Remove conflicting manual Eq/Hash impls for prost 0.14 (2 files) - Replace tonic_web::enable() with GrpcWebLayer (1 file) - Rename tonic tls feature to tls-ring (2 files) - Update h2 0.3.26 -> 0.4 for tonic 0.14 compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix deprecated pixi.toml fields Replace `[project]` with `[workspace]` and merge `[build-dependencies]` into `[dependencies]` to resolve pixi deprecation warnings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix flaky widget test by pinning JupyterLab to match voila frontend voila 0.5.x bundles JupyterLab 4.0.x shared modules, but jupyterlab >=4.4 was installing extensions requiring ^4.4.9 modules. This caused @jupyter-widgets/jupyterlab-manager to fail loading in voila, preventing canvas rendering in the choropleth widget test. Pin jupyterlab to >=4.0,<4.1 to match voila's bundled frontend. Also loosen ipywidgets and anywidget pins. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: Add support for custom plan executors * Tie some loose ends * More cleanup * Simplify from_sequence call * Attempty at fixing jupyter tests * Set emebed options for alt notebook in test_jupyter_widget * Port changes not applied correctly during rebase & cleanup * Remove unnecessary changes to python tests * Update new tests that cover expression compilation to be async * Add name method to PlanExecutor trait * Store MaterializedTaskValue instead of TaskValue on Task * Rename ExportUpdateArrow -> MaterializedExportUpdate * Remove .plan_executor() method from runtime trait * Remove stale comment & formatting * Remove pre_transform_logical_plan method * Eagerly materialize plans when using DataFusion executor * Fix deprecation warning in tests

) * Fix Linux CI: build cp39 wheels to match requires-python >= 3.9 The manylinux 2014 container defaults to Python 3.8, producing cp38 wheels that can't be installed in the pixi test environment (Python 3.10). Add -i python3.9 to maturin args for both x86_64 and arm64 Linux builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Increase canvas element retry timeout in jupyter widget tests Increase wait from 0.5s intervals / 10s timeout to 1s intervals / 30s timeout to reduce flaky failures on slower CI runners (e.g. choropleth). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ration (vega#584) * build: add protobuf dependencies and tooling config Add protobuf-src to workspace Cargo.toml, pyo3-arrow to Python crate, protobuf pip dependency to pixi, and configure ruff/mypy for new proto modules. Add CI step for protobuf compile test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: replace PlanExecutor with PlanResolver trait Introduce PlanResolver trait with ResolutionResult enum (Table | Plan) replacing the simpler PlanExecutor. This enables resolvers to either fully execute plans or rewrite them for DataFusion to execute. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add VegaFusionCodec, ExternalTableProvider, and InlineTableProvider VegaFusionCodec handles protobuf serialization with sidecar Arrow data. ExternalTableProvider is a marker TableProvider carrying schema and metadata for resolver dispatch. InlineTableProvider carries resolved data back through the plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: wire PlanResolver through VegaFusionRuntime Replace PlanExecutor with optional PlanResolver on runtime. Plans are serialized via VegaFusionCodec, sent to resolver, and results (table or rewritten plan with sidecar data) are handled. Remove old DataFusionPlanExecutor and maybe_materialize_plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Python PlanResolver bridge and API Implement PyPlanResolver Rust bridge with three-tier dispatch (resolve_table -> resolve_plan -> resolve_plan_proto), Python PlanResolver base class with protobuf tree walking, ExternalDataset helper, proto module, and runtime integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: update example and server for PlanResolver Rename custom_executor example to custom_resolver and update to use PlanResolver trait. Update server to pass None resolver to runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add Rust PlanResolver tests Rename test_plan_executor to test_plan_resolver with updated trait names. Add tests for no_resolver, table_returning_resolver, and inline_table_codec_round_trip. Update test_image_comparison for new runtime constructor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add comprehensive Python PlanResolver tests Cover resolve_table, resolve_plan, resolve_plan_proto, deserialization, external dataset registry, schema-only datasets, ResolvedPlan return, multiple tables, error propagation, and no-override passthrough. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add proto package __init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PlanResolver code review findings - Handle LogicalPlanNode|bytes in ResolvedPlan extraction with SerializeToString fallback (C2) - Add warning when resolve_ref returns None for GC'd datasets (S4) - Extract maybe_materialize_plan helper to deduplicate 3 identical blocks in tasks.rs (S3) - Add error propagation test for failing resolvers (S5) - Use #[derive(Default)] for VegaFusionCodec - Remove narration-style comments across Rust and Python files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: derive plan node child structure from protobuf descriptors Replace hardcoded frozensets (_SINGLE_INPUT, _TWO_CHILD, _MULTI_INPUT, _KNOWN_LEAF) with dynamic introspection of the LogicalPlanNode protobuf descriptor at import time. This eliminates manual maintenance when DataFusion adds new plan node types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: ExternalDataset registry stores data refs, not whole objects The WeakValueDictionary now holds _DataRef wrappers around the raw data object instead of the entire ExternalDataset. On the Rust→Python return path, ExternalDatasets are reconstructed from protobuf-sourced schema and metadata plus data recovered from the registry. This eliminates the redundancy where metadata traveled both through protobuf and through the weak reference, making protobuf the single source of truth for schema and metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add unparse_plan_to_sql for converting plans to SQL strings Adds a PyO3 function that deserializes protobuf plan bytes and converts them to SQL using DataFusion's Unparser with dialect support (default, postgres, mysql, sqlite, duckdb, bigquery). Python resolvers can call unparse_to_sql() on all or part of a logical plan to generate SQL for execution against external engines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: support pipeline of plan resolvers Change PlanResolver from a single optional resolver to a Vec of resolvers that form a pipeline. Each resolver is called in order; if one returns a materialized Table the pipeline short-circuits, otherwise the rewritten Plan is passed to the next resolver. This enables composing resolvers for different concerns (e.g., one that fetches data from S3 + one that pushes queries to Spark). Python API accepts a single resolver or a list via the plan_resolver property, maintaining backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add required `kind` field to ExternalDataset for better error messages ExternalDataset now takes a `kind` string (e.g. "spark", "snowflake") as its first argument. This label is propagated through protobuf as a top-level field in the codec envelope (separate from user metadata) and appears in error messages when no resolver is registered to handle external table references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: use inline snapshots for SQL unparse assertions Replace generic string checks with exact inline_snapshot assertions for all SQL dialect outputs, confirming external table names are preserved in generated SQL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove inline_snapshot dependency, fix ruff formatting Replace inline_snapshot assertions with plain string comparisons to avoid adding a CI dependency. Fix ruff format and lint issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix: remove inline_snapshot dependency, fix ruff formatting" This reverts commit aa3b00f. * fix: add inline-snapshot dependency, fix ruff formatting Add inline-snapshot to pypi-dependencies for snapshot testing of SQL unparse output. Fix ruff format and lint issues in test and runtime files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: replace override detection with requires_external_tables option Remove the check_has_override anti-pattern that introspected which Python methods were overridden. Replace with a requires_external_tables class attribute (default True) that resolvers can set to False to receive all plans, not just those with external tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add thread_safe resolver option for thread-affine backends When a PlanResolver sets thread_safe = False (e.g. for DuckDB in-memory databases), the runtime uses tokio's current-thread executor instead of multi-thread, ensuring resolver callbacks run on the calling thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: rename requires_external_tables to skip_when_no_external_tables Clearer name that describes the behavior (skipping) rather than a requirement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use VegaFusion's SessionContext for plan deserialization SessionContext::new() lacks VegaFusion's custom UDFs (MAKE_UTC_TIMESTAMP, TIMEUNIT_START_UDF, Q1, Q3). Use make_datafusion_context() so resolved plans containing these functions can be deserialized correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: clarify resolve_table description in PlanResolver docstring Remove internal implementation detail ("always returns Plan") that isn't relevant to users of the class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: error when combining plan resolvers with gRPC runtime Plan resolvers run locally and cannot work with remote gRPC runtimes. Add guards in grpc_connect and plan_resolver setter to raise ValueError when these incompatible options are combined. Also accept tuple input in _normalize_resolvers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: pass ExternalDataset directly to Rust instead of tuple Replace the anonymous (kind, schema, metadata) tuple with the ExternalDataset object itself. Rust side now reads .kind, .schema, .metadata by attribute name, making the FFI boundary self-documenting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: remove bare Schema inline dataset support in favor of ExternalDataset The ability to pass Arrow schemas as inline datasets was an unreleased interim pattern. ExternalDataset provides the same functionality with explicit kind labeling and metadata. Remove the schema path from both Python and Rust sides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: return tuple from plan_resolver getter and accept tuple input The getter now returns a tuple (immutable) instead of a list for multiple resolvers, preventing silent no-op mutations via .append(). All input sites (constructor, setter, normalizer) accept tuples too. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: minor test and API cleanup Rename _simple_spec to simple_spec in tests (no need for dunder prefix in test code). Return tuple from plan_resolver getter to prevent silent mutation via .append(). Accept tuple input in constructor and setter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: lazy-init _CHILD_FIELDS to avoid protobuf import at module load _build_child_fields() imports google.protobuf which is not a declared dependency. Defer computation to first use so importing vegafusion doesn't require protobuf to be installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: declare protobuf as optional dependency with clear error messages Add protobuf as an optional dependency under [plan-resolver] extra. Wrap all protobuf imports with try/except that explains when protobuf is needed and how to install it. Add protobuf to macOS/Windows CI test environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: assert google.protobuf is not loaded on vegafusion import Ensures the lazy init of protobuf is working correctly by checking that google.protobuf is not in sys.modules after importing vegafusion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Strengthen plan resolver usage tests * feat: introduce ResolverPipeline and DataFusionResolver Replace separate `Vec<Arc<dyn PlanResolver>>` + `Arc<SessionContext>` threading with a unified `ResolverPipeline` struct that wraps user resolvers and a terminal `DataFusionResolver`. This collapses two parameters into one across the entire call chain (TaskCall::eval, CompilationConfig, VegaFusionRuntime, etc.). Key changes: - Rename `kind: String` to `protocol: Option<String>` on ExternalTableProvider - Add `source: Option<String>` field to ExternalTableProvider - New `DataFusionResolver` as terminal PlanResolver (always returns Table) - New `ResolverPipeline` with `resolve()`, `has_user_resolvers()`, `ctx()` - Remove `execute_plan`, `resolve_plan_pipeline` free functions - Update all callers to use pipeline throughout This lays the foundation for adding `resolve_schema` in a future PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve Python lint issues (ruff format + undefined name) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: dynamically download and generate datafusion proto files Replace hardcoded datafusion version in download-datafusion-proto pixi task with Python scripts that extract the version from Cargo.toml and recursively resolve proto imports. Update proto files to match the current datafusion 52.1.0 dependency (which splits common types into a separate datafusion_common.proto file). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: regenerate proto Python code and exclude from ruff lint Regenerate pb2 files from the updated 52.1.0 proto so they match what gen-proto-py produces. Widen ruff exclude to cover all generated *_pb2.py files, not just the top-level one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: exclude new proto_common_pb2 from mypy type checking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve mypy errors in plan_resolver.py Suppress protobuf-related mypy errors (attr-defined, no-any-return, union-attr) for plan_resolver.py since dynamically-generated pb2 classes are invisible to static analysis. Also wrap str() around protobuf attribute returns to satisfy no-any-return in helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin protobuf >=5.27 and regenerate proto code Pin protobuf to 5.x in pixi.toml so generated pb2 files are consistent between local dev and CI. Regenerated files now use protobuf 5.x runtime version validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip mypy analysis of _pytest internals The installed pytest uses match/case syntax (Python 3.10+) which mypy cannot parse when python_version is set to 3.9. Skip following imports into _pytest since we don't need to type-check pytest itself. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin protobuf to exact version 5.29.3 Protobuf code generation embeds the exact version number in generated files. Pin to 5.29.3 so local and CI produce identical output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add inline-snapshot and pin protobuf in CI test jobs The osx-arm64 and win-64 test jobs install dependencies manually via pip rather than pixi. Add missing inline-snapshot test dependency and pin protobuf==5.29.3 to match the version used for proto codegen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add .gitattributes to collapse generated files in PR diffs Marks lock files, downloaded .proto files, and generated *_pb2.py files as linguist-generated so GitHub collapses them by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip datafusion-python tests when package not installed The osx-arm64 and win-64 CI jobs don't install the datafusion Python package. Use pytest.importorskip so these tests are skipped rather than failing with ModuleNotFoundError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add name() method to PlanResolver trait Each resolver now identifies itself with a name, enabling better logging and diagnostics in the resolver pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

- Restore vegafusion-runtime/src/sql/ module (Spark SQL translation) - Add SparkSqlPlanResolver implementing upstream PlanResolver trait - Add new_vendor/new_embedded_vendor API for vendor-specific runtimes - Use ExternalDataset for schema-only inline datasets in Spark tests - Restore fork CI workflows (pr.yml, build-and-test.yml, release.yml) - Restore Spark e2e test specs and test data

- Rewrite PostgreSQL regex operators (~, ~*, !~, !~*) to regexp_like() calls since Spark doesn't support the ~ operator that DF 52's unparser now generates - Sort DataFrames before comparison in Spark e2e tests to handle non-deterministic GROUP BY ordering from DF 52's changed unparser output (see apache/datafusion#19483) - Add unit test for the regex rewrite

coderabbitai · 2026-03-17T13:56:13Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces a plan resolver architecture to replace plan executors, enabling pluggable external table resolution strategies. Changes include: new PlanResolver trait in Rust with pipeline-based resolution; ExternalTableProvider and InlineTableProvider for deferring table execution; VegaFusionCodec for serializing extension nodes; MaterializedExportUpdate and MaterializedTaskValue for typed value materialization; DataFusion dependency upgraded to 52.1.0 with comprehensive datafusion_common.proto definitions; Python-side ExternalDataset and PlanResolver classes for user-defined resolution logic; proto code generation automation via download_datafusion_proto.py and gen_proto_py.py; and runtime refactoring to thread pipelines through task evaluation paths instead of direct context/executor references.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Runtime as VegaFusion Runtime
    participant ResolverPipeline as Resolver Pipeline
    participant UserResolver as User Resolver
    participant DataFusionResolver as DataFusion Resolver
    
    User->>Runtime: pre_transform_spec(spec, inline_datasets)
    Runtime->>Runtime: Apply pre-transforms
    Runtime->>ResolverPipeline: resolve(logical_plan)
    loop For each user resolver
        ResolverPipeline->>UserResolver: resolve_plan(plan)
        alt Plan returned
            UserResolver-->>ResolverPipeline: Plan(rewritten_plan)
            ResolverPipeline->>ResolverPipeline: Continue with next resolver
        else Table returned
            UserResolver-->>ResolverPipeline: Table(data)
            ResolverPipeline-->>Runtime: VegaFusionTable
        end
    end
    alt No table from user resolvers
        ResolverPipeline->>DataFusionResolver: resolve_plan(final_plan)
        DataFusionResolver->>DataFusionResolver: Execute via DataFusion
        DataFusionResolver-->>ResolverPipeline: VegaFusionTable
        ResolverPipeline-->>Runtime: VegaFusionTable
    end
    Runtime-->>User: Transformed spec + data

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

The changes are extensive and heterogeneous: new trait-based resolver system replacing executors, major proto schema additions (datafusion_common), proto code generation automation, Python FFI and bindings redesign, serialization codec implementation, materialization type refactoring across multiple crates, and async pipeline threading throughout task evaluation. Each area requires separate reasoning due to architectural significance.

Possibly related PRs

Remove unusable public Python API #13 — Both PRs modify plan execution/resolution surface, replace PlanExecutor with PlanResolver, refactor Python executor paths, and update pre-transform logical plan APIs across runtime and Python modules.

Suggested reviewers

mfranczel

linear · 2026-03-17T14:14:45Z

BLU-5815 Merge upstream changes into VegaFusion fork

coderabbitai

Actionable comments posted: 28

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

vegafusion-core/build.rs (1)

31-43: 🧹 Nitpick | 🔵 Trivial

De-duplicate the proto list before it drifts.

main() and gen_tonic() now carry the same proto paths by hand. The next proto addition can easily update one codegen path and silently miss the other.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vegafusion-core/build.rs` around lines 31 - 43, Extract the list of proto
paths into a single shared constant or accessor (e.g. a const PROTOS: &[&str] or
fn proto_files() -> Vec<&'static str>) and use that symbol from both main() and
gen_tonic() instead of repeating the hard-coded array; update the
tonic_prost_build::configure().compile_protos call to pass that shared PROTOS
and ensure any other code that previously duplicated the list references the
same symbol so additions/removals only happen in one place.

vegafusion-python/src/utils.rs (1)

96-104: ⚠️ Potential issue | 🟡 Minor

Change cast_bound::<PyAny> to cast_bound::<PyDict> for proper type validation.

When input is already Py<PyAny>, casting to PyAny always succeeds, making the "must be a string or dict" fallback unreachable. Non-dict inputs incorrectly proceed to depythonize and fail with a serde error instead of the advertised contract. Casting to PyDict properly validates that second-case inputs are dictionaries.
Proposed fix
-        } else if let Ok(chart_spec) = chart_spec.cast_bound::<PyAny>(py) {
+        } else if let Ok(chart_spec) = chart_spec.cast_bound::<PyDict>(py) {
             match depythonize::<ChartSpec>(&chart_spec.clone()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vegafusion-python/src/utils.rs` around lines 96 - 104, The branch that tries
to handle a dict currently uses chart_spec.cast_bound::<PyAny>(py) which always
succeeds and makes the final "must be a string or dict" fallback unreachable;
change the cast to chart_spec.cast_bound::<PyDict>(py) so non-dict values fail
the cast and hit the PyValErr. Locate the block using chart_spec.cast_bound,
keep the depythonize::<ChartSpec>(&chart_spec.clone()) call for successful
dicts, and ensure the error returned remains the same
(PyValErr::new_err("chart_spec must be a string or dict")) when the cast fails.
Also keep the existing serde error handling for depythonize::<ChartSpec>
unchanged.

vegafusion-core/src/task_graph/task.rs (1)

31-40: 🧹 Nitpick | 🔵 Trivial

Consider propagating error instead of unwrap().

Line 36 uses unwrap() on try_from(). If serialization fails, this panics. Since other methods in this impl return Result, consider making new_value return Result<Self> or use expect() with context.

Proposed fix

-    pub fn new_value(variable: Variable, scope: &[u32], value: MaterializedTaskValue) -> Self {
+    pub fn new_value(variable: Variable, scope: &[u32], value: MaterializedTaskValue) -> Result<Self> {
         Self {
             variable: Some(variable),
             scope: Vec::from(scope),
             task_kind: Some(TaskKind::Value(
-                ProtoMaterializedTaskValue::try_from(&value).unwrap(),
+                ProtoMaterializedTaskValue::try_from(&value)?,
             )),
             tz_config: None,
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@vegafusion-core/src/task_graph/task.rs` around lines 31 - 40, The constructor
new_value currently unwraps the result of
ProtoMaterializedTaskValue::try_from(&value) which will panic on serialization
errors; change new_value to propagate the error by returning Result<Self, E>
(matching the crate error type or using anyhow::Error) or at minimum return
Result<Self, TryFromProtoError> and replace the unwrap() with ?; update the
function signature (pub fn new_value(...) -> Result<Self, _>) and construct the
TaskKind::Value using the ?-propagated ProtoMaterializedTaskValue so callers can
handle serialization failures instead of panicking.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-and-test.yml:
- Around line 101-104: The CI step "Verify generated proto files are up to date"
currently runs "pixi run gen-proto-py" and then "git diff --exit-code
vegafusion-python/vegafusion/proto/" which misses untracked new files; update
this step to check both tracked diffs and untracked files by running the
existing git diff check and then failing if "git ls-files --others
--exclude-standard vegafusion-python/vegafusion/proto/" returns any output (or
equivalently test the output of "git status --porcelain
vegafusion-python/vegafusion/proto" is empty) so that newly generated untracked
proto files also cause the job to fail.

In `@automation/download_datafusion_proto.py`:
- Line 67: Replace the unnecessary f-string print call — change the print
invocation print(f"  datafusion.proto") to a normal string print(" 
datafusion.proto") (locate the print statement matching print(f" 
datafusion.proto") in automation/download_datafusion_proto.py and remove the
leading 'f').

In `@automation/gen_proto_py.py`:
- Line 32: The subprocess.run calls that assign to result (e.g., the call at the
line producing "result = subprocess.run(cmd, capture_output=True, text=True)"
and the similar call near line 95) manually check result.returncode; update both
subprocess.run invocations to include the explicit argument check=False so the
intent to handle return codes manually is clear and lint warnings are suppressed
(leave the existing manual returncode checks in place). Locate the calls in
automation/gen_proto_py.py by searching for "subprocess.run(" and add
check=False to those call sites.

In `@CLAUDE.md`:
- Line 5: The Markdown headings need blank lines above and below them; update
the file so "## Architecture" (and the other headings referenced at the same
locations) have an empty line immediately before and after the heading, and
apply the same fix for the headings at the other indicated positions (including
the ones mentioned for lines 19, 21, 27, and 33) so every heading is separated
by a blank line above and below.

In `@vegafusion-common/CLAUDE.md`:
- Line 5: Add a blank line immediately after each Markdown heading in CLAUDE.md
(e.g., the "## Key Modules" heading and the other headings noted at lines 21 and
27) so that there is a single empty line between the heading and the following
content; update each heading region to insert one newline below the heading to
satisfy MD022.

In `@vegafusion-core/CLAUDE.md`:
- Around line 30-35: Update the proto modification workflow in CLAUDE.md to
include the Python proto-generation step: after editing the .proto file and
running `cargo build -p vegafusion-core` and rebuilding dependents (e.g., `pixi
run dev-py`), also run the Python generator (`pixi run gen-proto-py` or the
exact generator command) to regenerate files under
vegafusion-python/vegafusion/proto/* so checked-in Python protobufs stay in sync
with Rust-generated code; mention build.rs as the Rust generator and instruct
contributors to rebuild both Rust and Python protobuf artifacts.

In `@vegafusion-core/src/runtime/runtime.rs`:
- Around line 406-422: The code currently pushes a PreTransformValuesWarning for
every table when row_limit is Some even if no truncation occurs; update the loop
that matches (row_limit, MaterializedTaskValue::Table(table)) to first check
whether the table actually exceeds the limit (e.g. if table.num_rows() >
row_limit or table.len() > row_limit depending on the table API) and only then
push the PreTransformValuesWarning (ValuesWarningType::RowLimit /
PreTransformRowLimitWarning) and return
MaterializedTaskValue::Table(table.head(row_limit)); otherwise leave
materialized_value unchanged and do not push a warning before pushing into
task_values.

In `@vegafusion-core/src/task_graph/task_value.rs`:
- Around line 99-116: The TryFrom implementation for MaterializedTaskValue calls
value.data.as_ref().unwrap(), which can panic on malformed
ProtoMaterializedTaskValue; update the try_from(&ProtoMaterializedTaskValue)
implementation to explicitly handle None by returning an appropriate
VegaFusionError instead of unwrapping (check for
value.data.as_ref().ok_or_else(...) or match on value.data.as_ref()), then
proceed with the existing MaterializedTaskValueData::Table and ::Scalar
branches; reference the TryFrom impl for MaterializedTaskValue, the try_from
function, and ProtoMaterializedTaskValue.data when making this change.

In `@vegafusion-python/checks/check_lazy_imports.py`:
- Around line 29-32: The print statements in check_lazy_imports.py that warn
about skipping the narwhals lazy import use implicit string concatenation so the
"{narwhals.__version__}" placeholder is not interpolated; update the offending
print calls (the one that constructs the warning message referencing
narwhals.__version__) to use an f-string by adding the leading f before the
second string literal so the {narwhals.__version__} is evaluated and the actual
version is printed.

In `@vegafusion-python/CLAUDE.md`:
- Around line 5-6: Update the CLAUDE.md "Structure" and build notes to reflect
that the crate is no longer just a single lib.rs with PyChartState: mention the
added pure-Python resolver and dataset modules and the generated
vegafusion/proto/* files, and add a proto-regeneration step (e.g., run the proto
codegen) alongside the existing "pixi run dev-py" instruction so readers know to
regenerate protos before running; reference lib.rs, PyChartState, the
resolver/dataset modules, vegafusion/proto/*, and the "pixi run dev-py" command
in the updated text.

In `@vegafusion-python/pyproject.toml`:
- Line 2: The distribution name in pyproject.toml is set to "vegafusion", which
conflicts with the upstream package; change the package distribution name (the
"name" field in pyproject.toml) back to the fork-specific value (e.g.,
"deepnote_vegafusion") so installs target deepnote_vegafusion while imports can
remain vegafusion; update the "name" entry accordingly to avoid colliding with
the upstream identity.

In `@vegafusion-python/src/plan_resolver.rs`:
- Around line 35-64: The code currently calls Python::attach directly inside the
async resolve_plan path (using Python::attach in the async task), which blocks
the Tokio worker; change resolve_plan to perform all Python GIL work inside a
blocking closure (e.g., tokio::task::spawn_blocking) instead of calling
Python::attach inline. Use the resolver's thread_safe flag (self.thread_safe or
the existing runtime selection logic) to decide whether to use spawn_blocking
for non-thread-safe resolvers or a non-blocking path for thread-safe ones, and
move/clone the py_resolver into the blocking closure and call Python::attach (or
acquire GIL) there; follow the pattern in vendor.rs spawn_blocking usage as a
reference. Ensure you only mutate or read Python objects inside that blocking
closure and return plain Rust types back to the async context.

In `@vegafusion-python/src/utils.rs`:
- Around line 27-45: The code branch that reconstructs an ExternalTableProvider
from inline_dataset is dropping the ExternalDataset.source field; extract the
source (e.g., let source: Option<String> =
inline_dataset.getattr("source")?.extract()?) from inline_dataset and preserve
it when creating the provider (either by passing it into
ExternalTableProvider::new or calling the provider's setter/constructor that
accepts source) so the reconstructed provider created in the provider variable
retains the original source value.

In `@vegafusion-python/tests/test_jupyter_widget.py`:
- Around line 464-466: The current retry on get_canvas() uses tenacity to retry
any exception for 30s; narrow it to only retry transient Selenium lookup errors
(e.g., NoSuchElementException and StaleElementReferenceException) by adding a
retry=retry_if_exception_type(...) clause to the decorator and import those
exceptions from selenium.common.exceptions, and make the stop timeout
configurable (use a fixture or an environment variable like CANVAS_RETRY_TIMEOUT
and pass int(...) into stop_after_delay) so long-running parametrized test runs
don't accumulate long delays.

In `@vegafusion-python/tests/test_plan_resolver.py`:
- Line 22: Tests introduced new ruff violations (ARG001/ARG002/RUF059/ANN401)
mainly from unused parameters and overly broad Any annotations; update the
signatures and variables to prefix intentionally unused names with an underscore
(e.g., change setup_module(module: Any) -> None to setup_module(_module: Any) ->
None) and for places where the resolver protocol requires dynamic types, replace
broad Any with a narrower type if possible or add a targeted inline noqa comment
(# noqa: ANN401) on that specific annotation; apply the same pattern to the
other flagged locations (lines referenced: 52, 94-99, 227-228, 289-290, 320-323,
371-372, 389-390, 443-446, 504-508, 576-577, 625-630, 688-689), and ensure
unused import warnings like warnings are renamed to _warnings or prefixed
similarly to satisfy ARG001/ARG002/RUF059.
- Around line 188-195: Replace the use of a fixed temp filename (the csv_path
variable created with tempfile.gettempdir() + "vf_resolver_test.csv") with a
per-test temporary file; e.g., use the pytest tmp_path fixture (or
tempfile.NamedTemporaryFile) to create a unique path and pass that to
pcsv.write_csv when creating the table in test_plan_resolver.py (update the
csv_path assignment and any test teardown accordingly). Apply the same change to
the other occurrences noted (the blocks around the csv usage at the other
occurrences referenced) so each test uses its own tmp_path/NamEDTemporaryFile
instead of a shared file.

In `@vegafusion-python/tests/test_spark_e2e.py`:
- Around line 137-139: The test sorts both inmemory_df and spark_df using
sort_cols which can yield inconsistent ordering for NaN values; update the two
calls to inmemory_df.sort_values(...) and spark_df.sort_values(...) to pass an
explicit na_position argument (e.g., na_position='last') so NaN ordering is
deterministic across implementations when comparing results.

In `@vegafusion-python/vegafusion/dataset.py`:
- Line 13: The __slots__ declaration is unsorted and triggers RUF023; update the
__slots__ tuple in the class (the "__slots__" symbol in this file) to have its
entries sorted alphabetically (e.g., "__weakref__" before "data") so the tuple
becomes sorted and ruff-clean.
- Around line 60-64: The current branch skips registering live data when
self._metadata already contains "_vf_ref_id", causing stale/missing registry
entries; change the logic in ExternalDataset initialization so that whenever
data is provided you always create a new _DataRef(data) and store it into
ExternalDataset._registry under the existing "_vf_ref_id" if present (reuse that
id) or under a newly generated uuid if not, then ensure
self._metadata["_vf_ref_id"] is set to that id; update references to _DataRef,
ExternalDataset._registry, and self._metadata["_vf_ref_id"] accordingly.

In `@vegafusion-python/vegafusion/plan_resolver.py`:
- Around line 147-197: The recursive walker _resolve_external_tables can
infinite-recurse on cyclic plans; modify its signature to accept an optional
visited set (e.g., visited: set[int] = None) and/or a depth counter with a
max_depth default, initialize visited if None, and at each call check and add
id(node) (or another stable node identity) before recursing; if id(node) is
already in visited or depth exceeds max_depth, return immediately. Update all
recursive calls (including the initial/public caller) to pass the visited
set/depth along so cycles are detected and traversal stops safely (references:
_resolve_external_tables, LogicalPlanNode, _get_child_nodes, sidecar).
- Around line 40-43: Replace the typing.Union usage for the ResolutionResult
alias with the modern pipe union syntax: change ResolutionResult = Union[Table,
ResolvedPlan] to use Table | ResolvedPlan, and ensure the file has from
__future__ import annotations at the top so forward references/type annotations
work; keep the same comment about the return semantics and update any references
in resolve_plan / resolve_plan_proto to the new alias if needed.
- Around line 200-210: The function _extract_table_name has unnecessary chained
elifs after returns; simplify control flow by replacing the elif branches with
plain if statements following the initial if so each branch returns immediately
without using elif/else, keeping the same conditions that use
table_ref.WhichOneof("table_reference_enum") and the same ValueError fallback
for unknown variants; ensure the function still handles "bare", "partial", and
"full" and raises the same ValueError when which is None or unexpected.

In `@vegafusion-runtime/CLAUDE.md`:
- Around line 5-85: The markdown file CLAUDE.md has lint warnings due to missing
blank lines before headings and around fenced code blocks; fix by inserting a
single blank line before each top-level and subsection heading (e.g., "Key
Modules", "Patterns", "Async Transforms", "Custom Functions", "Testing",
"Feature Flags", "Arrow String Type Handling") and ensure there is a blank line
both before and after each fenced code block (e.g., the Rust examples under
"When checking DataType", "When pattern matching ScalarValue", and "When
downcasting ArrayRef") so that all headings and ``` fenced blocks are separated
by a blank line from surrounding text.

In
`@vegafusion-runtime/src/expression/compiler/builtin_functions/data/vl_selection_test.rs`:
- Around line 545-604: Add a regression test that verifies FieldSpec::try_from
accepts all three Arrow string types: create a test function (in the same tests
mod) that constructs ScalarValue::Utf8, ScalarValue::LargeUtf8, and
ScalarValue::Utf8View with the string values that should map to the respective
FieldSpec variants, call FieldSpec::try_from on each, and assert the returned
value matches the expected FieldSpec variant (use matches! like the existing
tests for Op and SelectionType).

In `@vegafusion-runtime/src/expression/compiler/config.rs`:
- Around line 21-31: The Default impl for CompilationConfig currently does a
heavy DataFusion setup (make_datafusion_context + ResolverPipeline::new) which
causes wasted allocations; change Default to avoid creating a real pipeline
(e.g., set pipeline to a lightweight placeholder or make it an Option::None) and
add a new explicit constructor (e.g., CompilationConfig::with_pipeline(...) or
CompilationConfig::lightweight()) for cases that need a real ResolverPipeline;
then update build_compilation_config to stop relying on ..Default::default() and
instead set the pipeline explicitly using the new constructor or by passing a
real ResolverPipeline created with make_datafusion_context so the expensive
make_datafusion_context/ResolverPipeline::new call is only performed when
actually needed.

In `@vegafusion-runtime/src/sql/spark.rs`:
- Around line 473-478: The match over ast::BinaryOperator currently captures a
_case_insensitive flag for PGRegexIMatch and PGRegexNotIMatch but never uses it;
update the code path that builds the Spark expression (after the match that
defines negate and _case_insensitive) to pass a third flags argument to Spark's
regexp_like when _case_insensitive is true (e.g., supply "i"), and ensure the
negate behavior (from negate) remains applied; look for the match that defines
(negate, _case_insensitive) and the code that generates regexp_like to add the
conditional flags argument so case-insensitive operators (~* and !~*) behave
correctly.

In `@vegafusion-runtime/tests/test_spark_sql.rs`:
- Around line 346-372: The test currently uses regexp_like() directly so it
never triggers rewrite_pg_regex; change the test to create a filter that uses
the Postgres regex operator (e.g., the `~` or `~*` operator) instead of calling
regexp_like() directly, then run logical_plan_to_spark_sql(&plan) and assert
that the generated SQL contains regexp_like(...) and does not contain the `~`
operator; update assertions around regexp_like, the absence of " ~ " and the
expected_sql to match the rewritten form so rewrite_pg_regex is exercised and
validated.

In `@vegafusion-server/CLAUDE.md`:
- Around line 1-23: The markdown lacks required blank lines around headings and
code blocks; update the document so each heading (e.g., "Build", "CLI", "Feature
Flags", "Pitfalls") is preceded and followed by a blank line and ensure the
fenced code block for the build commands has a blank line before the opening
```bash and after the closing ``` to satisfy markdownlint rules; keep the same
content and ordering, only add the missing blank lines around the headings and
the code block.

---

Outside diff comments:
In `@vegafusion-core/build.rs`:
- Around line 31-43: Extract the list of proto paths into a single shared
constant or accessor (e.g. a const PROTOS: &[&str] or fn proto_files() ->
Vec<&'static str>) and use that symbol from both main() and gen_tonic() instead
of repeating the hard-coded array; update the
tonic_prost_build::configure().compile_protos call to pass that shared PROTOS
and ensure any other code that previously duplicated the list references the
same symbol so additions/removals only happen in one place.

In `@vegafusion-core/src/task_graph/task.rs`:
- Around line 31-40: The constructor new_value currently unwraps the result of
ProtoMaterializedTaskValue::try_from(&value) which will panic on serialization
errors; change new_value to propagate the error by returning Result<Self, E>
(matching the crate error type or using anyhow::Error) or at minimum return
Result<Self, TryFromProtoError> and replace the unwrap() with ?; update the
function signature (pub fn new_value(...) -> Result<Self, _>) and construct the
TaskKind::Value using the ?-propagated ProtoMaterializedTaskValue so callers can
handle serialization failures instead of panicking.

In `@vegafusion-python/src/utils.rs`:
- Around line 96-104: The branch that tries to handle a dict currently uses
chart_spec.cast_bound::<PyAny>(py) which always succeeds and makes the final
"must be a string or dict" fallback unreachable; change the cast to
chart_spec.cast_bound::<PyDict>(py) so non-dict values fail the cast and hit the
PyValErr. Locate the block using chart_spec.cast_bound, keep the
depythonize::<ChartSpec>(&chart_spec.clone()) call for successful dicts, and
ensure the error returned remains the same (PyValErr::new_err("chart_spec must
be a string or dict")) when the cast fails. Also keep the existing serde error
handling for depythonize::<ChartSpec> unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aea57135-cb5b-4702-b597-8ff475607758

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5ea1d and ccbfd64.

⛔ Files ignored due to path filters (3)

Cargo.lock is excluded by !**/*.lock
pixi.lock is excluded by !**/*.lock
vegafusion-wasm/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (113)

.gitattributes
.github/workflows/build-and-test.yml
CLAUDE.md
Cargo.toml
automation/download_datafusion_proto.py
automation/gen_proto_py.py
examples/rust-examples/Cargo.toml
examples/rust-examples/examples/custom_resolver.rs
pixi.toml
ruff.toml
vegafusion-common/CLAUDE.md
vegafusion-common/Cargo.toml
vegafusion-common/src/data/json_writer.rs
vegafusion-common/src/data/table.rs
vegafusion-common/src/datatypes.rs
vegafusion-common/src/error.rs
vegafusion-core/CLAUDE.md
vegafusion-core/Cargo.toml
vegafusion-core/build.rs
vegafusion-core/src/data/tasks.rs
vegafusion-core/src/expression/ast/expression.rs
vegafusion-core/src/planning/apply_pre_transform.rs
vegafusion-core/src/planning/projection_pushdown.rs
vegafusion-core/src/planning/stringify_local_datetimes.rs
vegafusion-core/src/planning/watch.rs
vegafusion-core/src/proto/pretransform.proto
vegafusion-core/src/runtime/mod.rs
vegafusion-core/src/runtime/plan_executor.rs
vegafusion-core/src/runtime/plan_resolver.rs
vegafusion-core/src/runtime/runtime.rs
vegafusion-core/src/spec/chart.rs
vegafusion-core/src/spec/data.rs
vegafusion-core/src/spec/visitors.rs
vegafusion-core/src/task_graph/memory.rs
vegafusion-core/src/task_graph/task.rs
vegafusion-core/src/task_graph/task_value.rs
vegafusion-core/src/transform/pipeline.rs
vegafusion-core/src/transform/sequence.rs
vegafusion-core/src/variable/mod.rs
vegafusion-python/.mypy.ini
vegafusion-python/CLAUDE.md
vegafusion-python/Cargo.toml
vegafusion-python/checks/check_lazy_imports.py
vegafusion-python/proto/SOURCE.txt
vegafusion-python/proto/datafusion.proto
vegafusion-python/proto/datafusion/proto-common/proto/datafusion_common.proto
vegafusion-python/pyproject.toml
vegafusion-python/src/chart_state.rs
vegafusion-python/src/lib.rs
vegafusion-python/src/plan_resolver.rs
vegafusion-python/src/unparse.rs
vegafusion-python/src/utils.rs
vegafusion-python/src/vendor.rs
vegafusion-python/tests/test_jupyter_widget.py
vegafusion-python/tests/test_plan_resolver.py
vegafusion-python/tests/test_spark_e2e.py
vegafusion-python/vegafusion/__init__.py
vegafusion-python/vegafusion/dataset.py
vegafusion-python/vegafusion/plan_resolver.py
vegafusion-python/vegafusion/proto/__init__.py
vegafusion-python/vegafusion/proto/datafusion/__init__.py
vegafusion-python/vegafusion/proto/datafusion/proto_common/__init__.py
vegafusion-python/vegafusion/proto/datafusion/proto_common/proto/__init__.py
vegafusion-python/vegafusion/proto/datafusion/proto_common/proto/datafusion_common_pb2.py
vegafusion-python/vegafusion/proto/datafusion_pb2.py
vegafusion-python/vegafusion/runtime.py
vegafusion-runtime/CLAUDE.md
vegafusion-runtime/Cargo.toml
vegafusion-runtime/src/data/codec.rs
vegafusion-runtime/src/data/datafusion_resolver.rs
vegafusion-runtime/src/data/external_table.rs
vegafusion-runtime/src/data/inline_table.rs
vegafusion-runtime/src/data/mod.rs
vegafusion-runtime/src/data/pipeline.rs
vegafusion-runtime/src/data/tasks.rs
vegafusion-runtime/src/datafusion/udfs/datetime/make_timestamptz.rs
vegafusion-runtime/src/datafusion/udfs/datetime/timeunit.rs
vegafusion-runtime/src/expression/compiler/binary.rs
vegafusion-runtime/src/expression/compiler/builtin_functions/data/vl_selection_test.rs
vegafusion-runtime/src/expression/compiler/call.rs
vegafusion-runtime/src/expression/compiler/config.rs
vegafusion-runtime/src/lib.rs
vegafusion-runtime/src/plan_executor.rs
vegafusion-runtime/src/signal/mod.rs
vegafusion-runtime/src/sql/spark.rs
vegafusion-runtime/src/task_graph/runtime.rs
vegafusion-runtime/src/task_graph/task.rs
vegafusion-runtime/src/transform/aggregate.rs
vegafusion-runtime/src/transform/collect.rs
vegafusion-runtime/src/transform/extent.rs
vegafusion-runtime/src/transform/fold.rs
vegafusion-runtime/src/transform/identifier.rs
vegafusion-runtime/src/transform/impute.rs
vegafusion-runtime/src/transform/pivot.rs
vegafusion-runtime/src/transform/stack.rs
vegafusion-runtime/src/transform/window.rs
vegafusion-runtime/tests/test_expression_evaluation.rs
vegafusion-runtime/tests/test_image_comparison.rs
vegafusion-runtime/tests/test_plan_executor.rs
vegafusion-runtime/tests/test_plan_resolver.rs
vegafusion-runtime/tests/test_planning.rs
vegafusion-runtime/tests/test_spark_sql.rs
vegafusion-runtime/tests/test_task_graph_runtime.rs
vegafusion-runtime/tests/test_transform_pivot.rs
vegafusion-runtime/tests/util/equality.rs
vegafusion-server/CLAUDE.md
vegafusion-server/Cargo.toml
vegafusion-server/build.rs
vegafusion-server/src/main.rs
vegafusion-server/tests/test_task_graph_runtime.rs
vegafusion-wasm/CLAUDE.md
vegafusion-wasm/Cargo.toml
vegafusion-wasm/package.json

💤 Files with no reviewable changes (7)

vegafusion-core/src/runtime/plan_executor.rs
vegafusion-runtime/src/lib.rs
vegafusion-runtime/tests/test_plan_executor.rs
vegafusion-runtime/src/plan_executor.rs
vegafusion-common/src/error.rs
vegafusion-core/src/variable/mod.rs
vegafusion-runtime/tests/test_planning.rs

OlegWock · 2026-03-18T07:29:06Z

Fork-only changes in this file

OlegWock · 2026-03-18T07:32:34Z

+        sort_cols = list(inmemory_df.columns)
+        inmemory_df = inmemory_df.sort_values(sort_cols, ignore_index=True)
+        spark_df = spark_df.sort_values(sort_cols, ignore_index=True)


Newer version of DataFusion, in some cases returned rows were in different order, so sorting them before comparing is required. This shouldn't affect charting results

OlegWock · 2026-03-18T07:39:47Z

Fork-only changes in this file

OlegWock · 2026-03-18T07:43:48Z

Fork-only changes in this file

Merge main into feature branch to create a clean merge path back. All conflicts resolved in favor of the feature branch, which contains the correct state (upstream PlanResolver API + Spark integration). # Conflicts: # vegafusion-python/checks/check_lazy_imports.py

OlegWock · 2026-03-18T07:56:10Z

@coderabbitai resolve and pause reviews

coderabbitai · 2026-03-18T07:58:03Z

@OlegWock Sure, resolving all comments and pausing reviews.

✅ Actions performed

Comments resolved and changes approved.

Reviews paused.

OlegWock · 2026-03-18T10:40:59Z

@coderabbitai approve

coderabbitai · 2026-03-18T10:41:12Z

✅ Actions performed

Comments resolved and changes approved.

m1so

LGTM, left one minor comment about redundant asserts in regex rewrite test

OlegWock · 2026-03-18T15:21:08Z

@coderabbitai approve

coderabbitai · 2026-03-18T15:21:20Z

✅ Actions performed

Comments resolved and changes approved.

jonmmease and others added 10 commits December 26, 2025 19:48

chore: bump version to 2.1.0

4c526a4

coderabbitai Bot requested changes Mar 17, 2026

View reviewed changes

OlegWock commented Mar 18, 2026

View reviewed changes

Comment thread vegafusion-python/src/vendor.rs

Copy link
Copy Markdown

Author

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fork-only changes in this file

OlegWock commented Mar 18, 2026

View reviewed changes

Comment thread vegafusion-runtime/src/sql/spark.rs

Copy link
Copy Markdown

Author

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fork-only changes in this file

OlegWock commented Mar 18, 2026

View reviewed changes

Comment thread vegafusion-runtime/tests/test_spark_sql.rs

Copy link
Copy Markdown

Author

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fork-only changes in this file

coderabbitai Bot previously approved these changes Mar 18, 2026

View reviewed changes

OlegWock dismissed coderabbitai[bot]’s stale review via fb5498b March 18, 2026 08:04

OlegWock force-pushed the oleh/merge-plan-resolvers-back-from-upstream branch from ccbfd64 to fb5498b Compare March 18, 2026 08:04

Pass Python interpreter version to maturin action

998603e

OlegWock marked this pull request as ready for review March 18, 2026 10:40

coderabbitai Bot previously approved these changes Mar 18, 2026

View reviewed changes

OlegWock requested review from m1so and mfranczel March 18, 2026 10:58

m1so previously approved these changes Mar 18, 2026

View reviewed changes

Comment thread vegafusion-runtime/tests/test_spark_sql.rs Outdated

Remove redundant asserts

3b1110b

OlegWock dismissed stale reviews from m1so and coderabbitai[bot] via 3b1110b March 18, 2026 15:06

coderabbitai Bot approved these changes Mar 18, 2026

View reviewed changes

OlegWock requested a review from m1so March 18, 2026 15:25

m1so approved these changes Mar 18, 2026

View reviewed changes

OlegWock merged commit 3b1110b into main Mar 19, 2026
10 checks passed

OlegWock deployed to pypi March 19, 2026 08:12 — with GitHub Actions View deployment

Conversation

OlegWock commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge without squashing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

linear Bot commented Mar 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

OlegWock Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

OlegWock commented Mar 18, 2026

Uh oh!

coderabbitai Bot commented Mar 18, 2026

Uh oh!

OlegWock commented Mar 18, 2026

Uh oh!

coderabbitai Bot commented Mar 18, 2026

Uh oh!

m1so left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OlegWock commented Mar 18, 2026

Uh oh!

coderabbitai Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OlegWock commented Mar 17, 2026 •

edited

Loading

coderabbitai Bot commented Mar 17, 2026 •

edited

Loading