ci: use uv in lint job build step#1
Open
geruh wants to merge 28 commits into
Open
Conversation
…siblings (lance-format#6679) ## Summary `Field::apply_projection` special-cases `Map` types by cloning all children unconditionally (to preserve the entries-struct `Struct<key, value>` physical layout), but applies that special case **before** checking whether the parent Map is in the projection. Result: every `Map` column survives every projection regardless of what the caller asked for. On real workloads this turns a narrow scan into a wide one: a scalar-index training scan over an `Int32` column on a table with sibling `Map<…, binary>` columns ends up feeding all of them into `SortExec` and blowing external-sort spill from a few GiB to >100 GiB. ## Fix One early return: if the parent Map isn't selected, drop the subtree the same way every other non-selected leaf is dropped. ```rust if self.logical_type.is_map() && !projection.contains_field_id(self.id) { return None; } ``` The clone-children branch is unchanged for the case where the parent _is_ selected. ## Test `scan_training_data_does_not_pull_unrelated_map_siblings` in `rust/lance/src/index/scalar.rs`: builds a 4-column dataset (`indexed: Int32` + three `Map` siblings), trains a BTree on `indexed`, asserts the scan plan's `LanceRead` projection is exactly `[indexed]`. Pins `LanceFileVersion::V2_2` since `Map` requires 2.2+. ## Notes - No behavior change for projections that include the Map column, or for non-Map types. - Cherry-picks cleanly onto `v4.0.x`, `Field::apply_projection` is identical between `main` and `v4.0.0`. - The test was generated by Claude.
…format#6674) ## Changes Adding `equals()` and `hashCode()` to `MatchQuery`, `PhraseQuery`, and `MultiMatchQuery` in `FullTextQuery`. ## Why The concrete subclasses currently inherit reference equality from `Object`. Two independently constructed instances with identical fields are not `equals()` and produce different hash codes. This was surfaced while trying to sketch an impl for SQL FTS query support in lance-spark, which allows users to push `lance_match`, `lance_phrase`, and `lance_multi_match` predicates down to the Lance FTS inverted index at query planning time. That work requires transporting `FullTextQuery` instances across the driver -> executor boundary, serializing them to a stable intermediate form and reconstructing them on the other side. Without structural equality, comparing a reconstructed instance to the original always returns `false` even when every field is identical, making correct scan identity checks impossible. More broadly, any consumer that serializes a `FullTextQuery` and reconstructs it later faces the same problem. The fix is mechanical and non-breaking: callers that never compared these objects structurally are unaffected.
…-format#6704) The JNI vector trainer hardcoded `MetricType::L2` in both `inner_train_ivf_centroids` and `inner_train_pq_codebook`, ignoring the metric the caller intends to use when later building per-fragment index segments. For non-L2 metrics (cosine, dot, hamming) this silently trained centroids and PQ codebooks on L2 geometry while encoding happened against the user's actual metric, producing degraded recall with no error or warning. Add `DistanceType distanceType` as a parameter to `VectorTrainer.trainIvfCentroids` and `trainPqCodebook` (mirroring the shape of the underlying Rust `build_ivf_model` / `build_pq_model`, which take `metric_type` separately from `IvfBuildParams` / `PQBuildParams`). The metric belongs at the function call site, not on the algorithm-config struct — the param structs stay isomorphic to their Rust counterparts and there is a single source of truth for the metric per training call. Backward-compat overloads default `distanceType` to `DistanceType.L2` to preserve the existing call sites. The native methods take the distance type as a separate `String` argument and parse it via `MetricType::try_from`. Add regression tests `testTrainIvfCentroidsHonorsDistanceType` and `testTrainPqCodebookHonorsDistanceType` that train the same dataset twice with L2 and Cosine and assert the resulting arrays differ. With the bug present, the two arrays were identical because both paths fell through to L2.
Update rewrites affected rows using the dataset physical schema. Avoid
wrapping the scan output in DatasetRecordBatchStream, which may convert
internal JSON columns from lance.json/LargeBinary to arrow.json/Utf8 for
user-facing reads and cause schema mismatches during rewrite. Add
coverage for updating both regular and JSON columns.
Error msg
```
Traceback (most recent call last):
File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 298, in <module>
chunsheng_debug()
~~~~~~~~~~~~~~~^^
File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 291, in chunsheng_debug
ds.update({'speaker_id': '"SPEAKER_9172"'}, where="name='沈逸'")
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bytedance/miniconda3/lib/python3.13/site-packages/lance/dataset.py", line 2577, in update
return self._ds.update(updates, where, conflict_retries, retry_timeout)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Expected schema Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "lance.json"} }, Field { name: "user_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} } but got Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "arrow.json"} }, Field { name: "user_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} }, /Users/runner/work/lance/lance/rust/lance/src/dataset/write/update.rs:274:24
Process finished with exit code 1
```
Closes lance-format#6329
…e-format#6749) ## Summary - Adds `LsmScanner::without_base_table(schema, base_path, snapshots, pk_columns)` and `LsmDataSourceCollector::without_base_table(base_path, snapshots)` so callers can scan only the active memtable and flushed L0 generations. - Internally, `base_table` becomes `Option<Arc<Dataset>>`; `collect()` / `collect_for_shards()` skip the base source when it is absent. All three planners (scan, point lookup, vector search) are unaffected since they already drive off the collector and keep their own `base_schema` field. - Existing `LsmScanner::new(...)` and `LsmDataSourceCollector::new(...)` signatures are unchanged, so Python bindings and benches keep compiling without edits. ## Motivation Some callers own the base read path elsewhere and only need the WAL's contribution to a query — the active memtable ∪ L0 flushed generations. Without this change, `LsmScanner` always pulls the base dataset in, which duplicates work and forces those callers to subtract the base contribution out of band. With the new constructor, the same scanner code path serves both modes; dedup semantics across generations are unchanged. ## Test plan - [x] `cargo test -p lance --lib dataset::mem_wal::scanner` — 55/55 passing, including three new tests: - `test_lsm_scan_without_base_table` — verifies the plan excludes any base-table `LanceRead` and that ids only present in the base are absent from the result. - `test_lsm_scan_without_base_table_no_flushed_no_active` — empty fresh tier returns an empty plan cleanly. - `test_point_lookup_without_base_table` — point lookup hits a flushed generation; missing key returns empty. - `test_vector_search_without_base_table` — plan construction succeeds and excludes the base. - [x] `cargo clippy -p lance --tests -- -D warnings` clean - [x] `cargo fmt -p lance` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance-format#6566) ## Summary - refactor the Lance format specification so index formats are a top-level section - split catalog specs and namespace client spec into distinct sections in the published docs - move full website assembly into reusable docs scripts and CI wiring - keep the docs buildable with placeholders when external repos are absent Namespace PR: lance-format/lance-namespace#326 Vote: lance-format#6567 Preview: https://jackye1995.github.io/lance/
…format#6772) ## Summary Adds `Dataset.takeRows(List<Long> rowIds, List<String> columns)` to the Java SDK, mirroring Rust's existing `Dataset::take_rows()` by adding a JNI binding `take()` accepts logical row indices (positions). `takeRows()` accepts physical row IDs from the `_rowid` system column — these are stable across compaction and deletion, which makes them suitable for applications that store row IDs externally (e.g., in a secondary index) and later need to fetch the corresponding row data. ## Changes - `java/lance-jni/src/blocking_dataset.rs`: JNI shim `nativeTakeRows` + `inner_take_rows` — mirrors `inner_take` but calls `dataset.take_rows()` instead of `dataset.take()` - `java/src/main/java/org/lance/Dataset.java`: public `takeRows()` method + native declaration - `java/src/test/java/org/lance/DatasetTest.java`: `testTakeRows` covering correctness, input-order preservation, and edge cases (empty/null rejection) --------- Co-authored-by: Alexandra Li <alexandra.li@databricks.com>
…tate (lance-format#6753) ## Summary The existing `memory://` object store provider returns a fresh `InMemory` backend on every `new_store` call, which makes it impossible for independent components in the same process to coordinate through a shared in-memory store — each writes to its own isolated bytes even when the URL matches. This PR adds a sibling `shared-memory://` scheme that wraps `MemoryStoreProvider` and swaps the inner backend for a process-global `Arc<InMemory>` keyed by URL authority. - All `shared-memory://bucket-a/...` URLs resolve to the same underlying `InMemory` backend (the URL path is the object key within that backend), so a write to `shared-memory://bucket-a/x` from one component is visible to another component reading `shared-memory://bucket-a/x` — across `ObjectStoreRegistry` instances, threads, and unrelated callers in the same process. - `shared-memory://bucket-a/...` and `shared-memory://bucket-b/...` resolve to separate backends and remain isolated. - `memory://` is **unchanged**, so existing tests that rely on per-call isolation are unaffected. The cache lives in a process-global `LazyLock<Mutex<HashMap<String, Arc<InMemory>>>>` rather than on the provider instance because `ObjectStore::from_uri` constructs a fresh `ObjectStoreRegistry` per call — a per-provider cache wouldn't actually share state across components. ### Why a new scheme rather than caching `memory://` `"memory://"` (no path) is reused as a generic "give me a temp store" URL across ~70+ test sites in the codebase (e.g. `Dataset::write(data, "memory://", ...)`). With `cargo test` running in parallel, adding caching to `memory://` would cause cross-test contamination at the dataset root. Opt-in via a new scheme avoids any migration churn. ### Use case Tests and harnesses that need multiple actors to coordinate through a common in-memory object store — e.g. a writer and an independent reader, multi-pod fence simulations, manifest-store + WAL-writer agreement on the same backing bytes. ## Test plan - [x] `cargo test -p lance-io --lib shared_memory` — 5 new tests pass (cross-registry sharing, per-authority isolation, path extraction, prefix uniqueness, end-to-end via `from_uri_and_params`) - [x] `cargo test -p lance-io --lib memory` — existing `memory://` tests still pass - [x] `cargo clippy -p lance-io --tests -- -D warnings` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) ## Summary Adds three opt-in `ShardWriter` APIs that let callers (tests, custom compaction loops, supervised flush flows) drive memtable flushing and fence detection synchronously rather than waiting on automatic triggers or the next manifest round-trip. - `force_seal_active()` — freezes the active memtable and enqueues it for L0 flush. No-op on an empty memtable; errors in WAL-only mode. - `wait_for_flush_drain()` — blocks until every frozen memtable in the L0 flush queue has landed and been recorded in the manifest. Pair with `force_seal_active` for "everything-on-disk" semantics. Does not wait on the active memtable. - `check_fenced()` — surfaces a successor writer's higher-epoch claim immediately, without waiting for the next manifest put. All three are pure additions on `ShardWriter`; no existing behavior changes. ## Test plan - [x] `cargo check -p lance --tests` - [x] `cargo clippy -p lance --tests -- -D warnings` - [x] New `test_force_seal_active_and_wait_for_flush_drain` verifies seal + drain advances the memtable generation and appends a flushed generation to the manifest - [x] New `test_check_fenced_detects_successor_claim` verifies fence is observable from the loser writer before any subsequent put 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormat#6786) Updated filename suffix from '.lance' to '.arrow' for WAL entries.
## Summary - Add `DeleteBuilder::execute_uncommitted()` so deletes can be staged and committed later with `CommitBuilder`. - Return an `UncommittedDelete` wrapper containing the delete `Transaction`, `affected_rows`, and `num_deleted_rows` so staged commits can preserve row-level conflict rebasing. - Factor delete transaction construction into a helper shared by committed and uncommitted delete paths. ## Test plan - `cargo fmt --all` - `cargo test -p lance dataset::write::delete::tests::test_delete_execute_uncommitted_preserves_affected_rows_for_rebase --lib` - `cargo test -p lance dataset::write::delete::tests::test_delete_false_predicate_still_commits --lib` - `cargo test -p lance dataset::write::delete::tests::test_concurrent_delete_with_retries --lib` - `cargo test -p lance dataset::write::delete::tests::test_delete_concurrency --lib` - `cargo test -p lance dataset::write::delete::tests --lib` Fixes lance-format#6658. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
…es (lance-format#6767) ## Summary After a writer flushed a memtable to L0 and an external compactor merged that generation into the base table — legitimately draining `flushed_generations` to empty — a subsequent restart re-replayed the original WAL entries into the new active memtable, duplicating rows on read. Two bugs were interacting: 1. **Disambiguation:** `replay_memtable_from_wal` distinguished "fresh shard" from "flushed and compacted" via `flushed_generations.is_empty()`. That works in a closed-world deployment but breaks the moment an external compactor enters the picture — and the compactor is the *intended* consumer that drains that vector, so the signal is structurally broken under OSS-WAL. 2. **Cursor never advanced:** `MemTableFlusher::flush` read `covered_wal_entry_position` from `memtable.last_flushed_wal_entry_position()`, but that field is only set by the `mark_wal_flushed` test helper. In production it stayed at 0, so `replay_after_wal_entry_position` never advanced past 0. Under 0-based WAL positions this masked bug #1 — both "fresh" and "post-flush-of-0" produced cursor=0. ## Fix - **WAL positions are now 1-based** (`FIRST_WAL_ENTRY_POSITION = 1`). A cursor of `0` unambiguously means "no flush has stamped this shard," so replay collapses to `cursor.saturating_add(1)` without consulting `flushed_generations`. - **`WalFlushHandler::handle`** writes the just-appended position back into `state.last_flushed_wal_entry_position` under the state lock before signalling the completion cell. - **`MemTableFlusher::flush` / `flush_with_indexes`** now take an explicit `covered_wal_entry_position` arg. The production caller derives it per-memtable from the `WalFlushResult` carried in the completion cell — authoritative under concurrent flushes — falling back to `memtable.frozen_at_wal_entry_position()` when freeze did not trigger a flush. - **State seed at open** uses the post-replay WAL tip, not `manifest.wal_entry_position_last_seen` (the latter is bumped on every tailer read and can sit above any flushed generation). - Proto field docs on `ShardManifest.replay_after_wal_entry_position` / `wal_entry_position_last_seen` updated to spell out the 1-based convention and what default-0 means. ## Test plan - [x] Added `test_memtable_replay_skips_entries_after_external_compaction` in `rust/lance/src/dataset/mem_wal/write.rs`: open writer, put rows, close (flush), simulate the compactor by directly committing a manifest with empty `flushed_generations`, reopen, assert the memtable is empty. Fails on the pre-fix code; passes now. - [x] `cargo test -p lance --lib dataset::mem_wal` — 236/236 pass - [x] `cargo test -p lance --lib` — 1600/1600 pass - [x] `cargo test -p lance-index --lib` — 302/302 pass - [x] `cargo clippy --all --tests --benches -- -D warnings` — clean - [x] `cargo fmt --all -- --check` — clean ## Compatibility WAL position numbering changes from 0-based to 1-based. Existing on-disk manifests / WAL files written by the prior `oss-wal-multiplex` code are not migrated — coordinated with downstream consumers (sophon) to start fresh. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urces (lance-format#6761) ## Summary `LsmVectorSearchPlanner::plan_search` produced wrong or no output whenever more than one LSM source contributed candidates. Three bugs combined to swallow the `_distance` column: 1. **Schema mismatch panic.** The base/flushed arm calls `scanner.fast_search()`, which forces `_rowid` into Lance's projection (`[id, vector, _rowid, _distance]`). The active-memtable arm produces `[id, vector, _distance]`. After `MemtableGenTagExec` adds `_memtable_gen`, `UnionExec` rejects the 5-vs-4 mismatch and the query panics before any `_distance` reaches the caller. 2. **Silent partition loss.** `UnionExec` emits one partition per input source. Both `SortExec` and `FilterStaleExec` only read partition 0 of their multi-partition input, so KNN hits (and their `_distance` values) from every source past the first silently vanished. 3. **Internal columns leaked.** `MemtableGenTagExec` was wrapped unconditionally, so `_memtable_gen` ended up in user-visible output even when no bloom filters were configured. ### Fixes - Add `project_to_canonical` to normalize every KNN source to `[projection..., _distance]` before union. This drops `_rowid` from the base/flushed arms so all inputs share one schema. - Insert `CoalescePartitionsExec` before `SortExec`, and again before `FilterStaleExec` on the bloom-filter path, so the merge sees every KNN candidate. - Only wrap with `MemtableGenTagExec` when bloom filters are configured; project `_memtable_gen` back out after `FilterStaleExec` so the public schema stays clean. - Route the active arm through `build_projection_for_knn` so explicit user projections don't drop the PK that staleness detection needs. ## Test plan - [x] `test_vector_search_base_plus_active_returns_distance` — end-to-end search across base + active memtable; asserts `_distance` present, nearest neighbor has near-zero distance, and `_rowid` / `_memtable_gen` are NOT in the output. - [x] `test_vector_search_with_projection_returns_distance_and_pk` — regression for the active-arm projection path; PK column auto-included when user projects only `vector`. - [x] `test_vector_search_with_bloom_filter_strips_memtable_gen` — exercises the bloom-filter branch; verifies the post-`FilterStaleExec` projection strips `_memtable_gen` and that union partitions feed correctly into the filter. - [x] All existing `mem_wal` tests pass. - [x] `cargo clippy -p lance --tests --benches -- -D warnings` clean. \U0001f916 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t#6740) When we enable stable row ID, if we update rows, we can not scan updated rows with a limit condition. <img width="1160" height="418" alt="image" src="https://github.com/user-attachments/assets/86a9dfbc-bed9-440d-b784-ecc2b638d953" /> <img width="1268" height="542" alt="image" src="https://github.com/user-attachments/assets/4b338fcb-9891-4377-9fef-b2302e72dc03" />
## Summary
A blob column can be decoded under two distinct shapes that share the
same physical column:
- the **descriptor** view (`Struct<position, size>`) used when scanning
with `BlobHandling::BlobsDescriptions` (the default for `to_table` and
`take_blobs`);
- the **bytes** view (`LargeBinary`) used by `BlobHandling::AllBinary`
and `compact_files`.
Both paths build a `StructuralPrimitiveFieldScheduler`, but the
page-level schedulers they instantiate cache different concrete
`CachedPageData` types. The cache key (`FieldDataCacheKey`) carried only
`column_index`, so when one shape populated the cache and a second
reader on the same `Dataset` / `Session` hit it with the other shape,
`BlobPageScheduler::load` would downcast the wrong state and panic:
```
panicked at rust/lance-encoding/src/encodings/logical/primitive/blob.rs:335:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
```
This change mixes a Debug-formatted representation of the target field's
`DataType` into the cache key. The two blob views now get their own
entries while non-blob columns keep sharing as before (their `DataType`
is invariant for a given field).
The accompanying regression test
(`test_blob_cache_key_distinguishes_views`) runs three back-to-back
scans on a single `Arc<Dataset>` — descriptor view, bytes view,
descriptor view — and asserts they all succeed and return the expected
bytes. Without the fix the second scan panics; with it the test passes.
Co-authored-by: Vova Kolmakov <wombatukun@apache.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
…ce-format#6805) ## Summary The `shared-memory://` object store provider (added in lance-format#6753) was gated behind `#[cfg(test)]`, so it only compiled for this project's own test builds. Downstream crates and integration harnesses depending on `lance-io` could never resolve the scheme — which defeats the purpose of having a shareable, cross-component in-memory store. This removes the `#[cfg(test)]` gates from the four production sites required for the provider to function: - **`providers.rs`** — module declaration + registration in `ObjectStoreRegistry::default()` - **`object_store.rs`** — `is_cloud()` classification (folded into the existing local/`memory` check) - **`commit.rs`** — commit-handler routing to `ConditionalPutCommitHandler` (folded into the existing cloud-scheme arm) The provider stays strictly **opt-in**: callers must explicitly use the `shared-memory://` scheme, so existing `memory://` per-call isolation is unchanged. The in-module `#[cfg(test)] mod tests` remains test-only. ## Testing - `cargo check -p lance-io -p lance-table` — clean - `cargo clippy -p lance-io -p lance-table --tests -- -D warnings` — clean - `cargo test -p lance-io shared_memory` — 5/5 pass - `cargo test -p lance-table test_commit_handler_from_url_memory_schemes` — pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary This PR adds a Lance-native in-memory HNSW implementation for MemWAL and wires it into the async `ShardWriter` path. The benchmark shape follows @jackye1995's suggestion to use a FineWeb-like baseline: a text payload around 1.5 KiB plus a 1024-dim `f32` vector column, or about 5760 bytes per row. I used hnswlib's construction hot path as a reference while adapting the implementation to Lance MemWAL's requirements: multi-reader/single-writer access, reader-visible publication during writes, and reuse of the vector data already held by the MemTable instead of copying vectors into a separate HNSW-owned buffer. Main changes: - add `rust/lance/src/dataset/mem_wal/hnsw/` with a MemWAL-oriented HNSW graph and Arrow-backed vector storage - update `mem_wal/index/hnsw.rs` to use the new graph for active MemTable vector search - add WAL queue stats so benchmarks can distinguish WAL flush lag from final close/drain work - add native Rust benchmarks under `rust/lance/benches/`, including side-by-side HNSW/hnswlib comparison scaffolding and the shard-writer WAL backpressure benchmark - add `--schema-shape fineweb|vector_only` to the native shard-writer benchmark so we can run both the FineWeb-shaped case and the older small `id + 1024-dim vector` case requested in review ## Benchmark Summary So Far Baseline before this work was roughly: - safe durable async+index throughput: ~3.66 MB/s (`batch=50`, 512 KiB WAL, no backpressure) - previous peak with manageable backpressure: ~6.17 MB/s (`batch=100`, 2 MiB WAL, ~24s drain) - bottleneck identified as active in-memory HNSW insert throughput, not S3 bandwidth Best current WAL-queue results on `c7i.16xlarge`, S3 bucket `jack-devland-build`, FineWeb-shaped rows, `--skip-close`, and explicit WAL queue stats: | batch | WAL | target rows/s | actual rows/s | MB/s | final WAL queue | max WAL queue | thread setting | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | | 10 | 16 MiB | 8000 | 8000 | 46.08 | 0.310% | 1.526% | rayon=48, tokio=16 | | 100 | 8 MiB | 6000 | 6000 | 34.56 | 0.130% | 1.070% | rayon=64, tokio=16 | | 1000 | 8 MiB | 8000 | 8000 | 46.08 | 0.200% | 1.500% | rayon=64, tokio=8 | Interpretation: for these paced single-shard runs, WAL flush can keep up at 34-46 MB/s without accumulating material WAL backlog. This is about 7.5x over the previous 6.17 MB/s peak and about 12.6x over the earlier 3.66 MB/s no-backpressure point on the same FineWeb-shaped workload. Thread ablation suggests high Tokio worker counts are not required for this path; the important CPU knob is the Rayon pool used by async index construction. Moderate Tokio settings (`4-16`) were enough in the tested single-writer workload. Small-schema follow-up for @jackye1995's comment, using `--schema-shape vector_only` (`id + 1024-dim f32 vector`, default row estimate 4160 bytes) on the 12xlarge EC2 runner, S3 bucket `jack-devland-build`, `--skip-close`, and the same WAL-queue stats: | batch | WAL | target rows/s | actual rows/s | MB/s | p99 ms | final WAL queue | max WAL queue | thread setting | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | | 10 | 16 MiB | 8000 | 7999.855 | 33.279 | 0.020 | 0.481% | 1.998% | rayon=64, tokio=64 | | 10 | 16 MiB | 11000 | 10999.782 | 45.759 | 0.019 | 0.481% | 2.495% | rayon=64, tokio=64 | | 100 | 8 MiB | 6000 | 5999.954 | 24.960 | 0.042 | 0.340% | 1.400% | rayon=64, tokio=32 | | 100 | 8 MiB | 9000 | 8999.850 | 37.439 | 0.045 | 0.340% | 2.100% | rayon=64, tokio=32 | | 1000 | 8 MiB | 8000 | 7999.913 | 33.280 | 0.084 | 0.100% | 1.800% | rayon=64, tokio=8 | | 1000 | 8 MiB | 11000 | 10999.909 | 45.760 | 0.154 | 4.600% | 4.600% | rayon=64, tokio=8 | This is not a full max-throughput sweep for the small schema, but it confirms the benchmark can run the requested older shape and that it reaches the same ~45.8 MB/s region at 11k rows/s, with the `batch=1000` high-rate case starting to show more WAL queue backlog than the smaller batches. ## Analysis Artifacts Saved local analysis: - `/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/wal-queue-all-runs-20260515/RESULTS.md` - `/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/thread-ablation-20260515/RESULTS.md` - `/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/vector-only-pr-20260515/` S3 result artifacts: - `s3://jack-devland-build/memwal-walqueue-panel-20260514T171240Z/_bench_results/` - `s3://jack-devland-build/memwal-walqueue-supplement-20260514T182826Z/_bench_results/` - `s3://jack-devland-build/memwal-walqueue-higher-20260515T001817Z/_bench_results/` - `s3://jack-devland-build/memwal-thread-ablation-20260515T060229Z/_bench_results/` - `s3://jack-devland-build/memwal-vector-only-pr-20260515T090611Z/_bench_results/` ## Validation - `cargo fmt --all --check` - `cargo fmt --manifest-path python/Cargo.toml --all --check` - `cargo check -p lance --bench mem_wal_shard_writer_backpressure` - `cargo check --manifest-path python/Cargo.toml --lib` - `cargo clippy -p lance --tests -- -D warnings` - `cargo test -p lance dataset::mem_wal::index::hnsw::tests::test_index_concurrent_insert_and_search -- --exact` cc @hamersaw --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
PEP 735 dependency-groups in pyproject.toml cause maturin to pass --group to pip, but the venv-created pip (24.0 bundled with Python 3.11) does not support this flag. Switch to uv venv + maturin develop --uv.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.