Skip to content

ci: use uv in lint job build step#1

Open
geruh wants to merge 28 commits into
mainfrom
pydev-ci-fix
Open

ci: use uv in lint job build step#1
geruh wants to merge 28 commits into
mainfrom
pydev-ci-fix

Conversation

@geruh
Copy link
Copy Markdown
Owner

@geruh geruh commented May 18, 2026

No description provided.

Xuanwo and others added 23 commits May 14, 2026 04:48
…siblings (lance-format#6679)

## Summary

`Field::apply_projection` special-cases `Map` types by cloning all
children unconditionally (to preserve the entries-struct `Struct<key,
value>` physical layout), but applies that special case **before**
checking whether the parent Map is in the projection. Result: every
`Map` column survives every projection regardless of what the caller
asked for.

On real workloads this turns a narrow scan into a wide one: a
scalar-index training scan over an `Int32` column on a table with
sibling `Map<…, binary>` columns ends up feeding all of them into
`SortExec` and blowing external-sort spill from a few GiB to >100 GiB.

## Fix

One early return: if the parent Map isn't selected, drop the subtree the
same way every other non-selected leaf is dropped.

```rust
if self.logical_type.is_map() && !projection.contains_field_id(self.id) {
    return None;
}
```

The clone-children branch is unchanged for the case where the parent
_is_ selected.

## Test

`scan_training_data_does_not_pull_unrelated_map_siblings` in
`rust/lance/src/index/scalar.rs`: builds a 4-column dataset (`indexed:
Int32` + three `Map` siblings), trains a BTree on `indexed`, asserts the
scan plan's `LanceRead` projection is exactly `[indexed]`. Pins
`LanceFileVersion::V2_2` since `Map` requires 2.2+.

## Notes

- No behavior change for projections that include the Map column, or for
non-Map types.
- Cherry-picks cleanly onto `v4.0.x`, `Field::apply_projection` is
identical between `main` and `v4.0.0`.
- The test was generated by Claude.
…format#6674)

## Changes

Adding `equals()` and `hashCode()` to `MatchQuery`, `PhraseQuery`, and
`MultiMatchQuery` in `FullTextQuery`.

## Why

The concrete subclasses currently inherit reference equality from
`Object`. Two independently constructed instances with identical fields
are not `equals()` and produce different hash codes.

This was surfaced while trying to sketch an impl for SQL FTS query
support in lance-spark, which allows users to push `lance_match`,
`lance_phrase`, and `lance_multi_match` predicates down to the Lance FTS
inverted index at query planning time. That work requires transporting
`FullTextQuery` instances across the driver -> executor boundary,
serializing them to a stable intermediate form and reconstructing them
on the other side. Without structural equality, comparing a
reconstructed instance to the original always returns `false` even when
every field is identical, making correct scan identity checks
impossible.

More broadly, any consumer that serializes a `FullTextQuery` and
reconstructs it later faces the same problem. The fix is mechanical and
non-breaking: callers that never compared these objects structurally are
unaffected.
…-format#6704)

The JNI vector trainer hardcoded `MetricType::L2` in both
`inner_train_ivf_centroids` and `inner_train_pq_codebook`, ignoring the
metric the caller intends to use when later building per-fragment index
segments. For non-L2 metrics (cosine, dot, hamming) this silently
trained centroids and PQ codebooks on L2 geometry while encoding
happened against the user's actual metric, producing degraded recall
with no error or warning.

Add `DistanceType distanceType` as a parameter to
`VectorTrainer.trainIvfCentroids` and `trainPqCodebook` (mirroring the
shape of the underlying Rust `build_ivf_model` / `build_pq_model`, which
take `metric_type` separately from `IvfBuildParams` / `PQBuildParams`).
The metric belongs at the function call site, not on the
algorithm-config struct — the param structs stay isomorphic to their
Rust counterparts and there is a single source of truth for the metric
per training call.

Backward-compat overloads default `distanceType` to `DistanceType.L2` to
preserve the existing call sites. The native methods take the distance
type as a separate `String` argument and parse it via
`MetricType::try_from`.

Add regression tests `testTrainIvfCentroidsHonorsDistanceType` and
`testTrainPqCodebookHonorsDistanceType` that train the same dataset
twice with L2 and Cosine and assert the resulting arrays differ. With
the bug present, the two arrays were identical because both paths fell
through to L2.
Update rewrites affected rows using the dataset physical schema. Avoid
wrapping the scan output in DatasetRecordBatchStream, which may convert
internal JSON columns from lance.json/LargeBinary to arrow.json/Utf8 for
user-facing reads and cause schema mismatches during rewrite. Add
coverage for updating both regular and JSON columns.

Error msg
```
Traceback (most recent call last):
  File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 298, in <module>
    chunsheng_debug()
    ~~~~~~~~~~~~~~~^^
  File "/Users/bytedance/Project/emr/jinglun-lance-hello-python/main.py", line 291, in chunsheng_debug
    ds.update({'speaker_id': '"SPEAKER_9172"'}, where="name='沈逸'")
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bytedance/miniconda3/lib/python3.13/site-packages/lance/dataset.py", line 2577, in update
    return self._ds.update(updates, where, conflict_retries, retry_timeout)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Expected schema Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "lance.json"} }, Field { name: "user_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: LargeBinary, nullable: true, metadata: {"ARROW:extension:name": "lance.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} } but got Schema { fields: [Field { name: "id", data_type: Int64, nullable: true }, Field { name: "users_id", data_type: Int64, nullable: true }, Field { name: "user_id", data_type: Utf8, nullable: true }, Field { name: "name", data_type: Utf8, nullable: true }, Field { name: "files", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "arrow.json"} }, Field { name: "user_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "output_text", data_type: Utf8, nullable: true }, Field { name: "ai_tags", data_type: Utf8, nullable: true, metadata: {"ARROW:extension:name": "arrow.json", "ARROW:extension:metadata": ""} }, Field { name: "created_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "updated_time", data_type: Timestamp(Microsecond, Some("Asia/Shanghai")), nullable: true }, Field { name: "del_flag", data_type: Int32, nullable: true }, Field { name: "speaker_id", data_type: Utf8, nullable: true }], metadata: {} }, /Users/runner/work/lance/lance/rust/lance/src/dataset/write/update.rs:274:24

Process finished with exit code 1
```

Closes lance-format#6329
…e-format#6749)

## Summary

- Adds `LsmScanner::without_base_table(schema, base_path, snapshots,
pk_columns)` and `LsmDataSourceCollector::without_base_table(base_path,
snapshots)` so callers can scan only the active memtable and flushed L0
generations.
- Internally, `base_table` becomes `Option<Arc<Dataset>>`; `collect()` /
`collect_for_shards()` skip the base source when it is absent. All three
planners (scan, point lookup, vector search) are unaffected since they
already drive off the collector and keep their own `base_schema` field.
- Existing `LsmScanner::new(...)` and `LsmDataSourceCollector::new(...)`
signatures are unchanged, so Python bindings and benches keep compiling
without edits.

## Motivation

Some callers own the base read path elsewhere and only need the WAL's
contribution to a query — the active memtable ∪ L0 flushed generations.
Without this change, `LsmScanner` always pulls the base dataset in,
which duplicates work and forces those callers to subtract the base
contribution out of band. With the new constructor, the same scanner
code path serves both modes; dedup semantics across generations are
unchanged.

## Test plan

- [x] `cargo test -p lance --lib dataset::mem_wal::scanner` — 55/55
passing, including three new tests:
- `test_lsm_scan_without_base_table` — verifies the plan excludes any
base-table `LanceRead` and that ids only present in the base are absent
from the result.
- `test_lsm_scan_without_base_table_no_flushed_no_active` — empty fresh
tier returns an empty plan cleanly.
- `test_point_lookup_without_base_table` — point lookup hits a flushed
generation; missing key returns empty.
- `test_vector_search_without_base_table` — plan construction succeeds
and excludes the base.
- [x] `cargo clippy -p lance --tests -- -D warnings` clean
- [x] `cargo fmt -p lance` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance-format#6566)

## Summary
- refactor the Lance format specification so index formats are a
top-level section
- split catalog specs and namespace client spec into distinct sections
in the published docs
- move full website assembly into reusable docs scripts and CI wiring
- keep the docs buildable with placeholders when external repos are
absent

Namespace PR: lance-format/lance-namespace#326
Vote: lance-format#6567
Preview: https://jackye1995.github.io/lance/
…format#6772)

## Summary
Adds `Dataset.takeRows(List<Long> rowIds, List<String> columns)` to the
Java SDK, mirroring Rust's existing `Dataset::take_rows()` by adding a
JNI binding

`take()` accepts logical row indices (positions). `takeRows()` accepts
physical row IDs from the `_rowid` system column — these are stable
across compaction and deletion, which makes them suitable for
applications that store row IDs externally (e.g., in a secondary index)
and later need to fetch the corresponding row data.

## Changes
- `java/lance-jni/src/blocking_dataset.rs`: JNI shim `nativeTakeRows` +
`inner_take_rows` — mirrors `inner_take` but calls
  `dataset.take_rows()` instead of `dataset.take()`
- `java/src/main/java/org/lance/Dataset.java`: public `takeRows()`
method + native declaration
- `java/src/test/java/org/lance/DatasetTest.java`: `testTakeRows`
covering correctness, input-order preservation, and edge cases
(empty/null rejection)

---------

Co-authored-by: Alexandra Li <alexandra.li@databricks.com>
…tate (lance-format#6753)

## Summary

The existing `memory://` object store provider returns a fresh
`InMemory` backend on every `new_store` call, which makes it impossible
for independent components in the same process to coordinate through a
shared in-memory store — each writes to its own isolated bytes even when
the URL matches.

This PR adds a sibling `shared-memory://` scheme that wraps
`MemoryStoreProvider` and swaps the inner backend for a process-global
`Arc<InMemory>` keyed by URL authority.

- All `shared-memory://bucket-a/...` URLs resolve to the same underlying
`InMemory` backend (the URL path is the object key within that backend),
so a write to `shared-memory://bucket-a/x` from one component is visible
to another component reading `shared-memory://bucket-a/x` — across
`ObjectStoreRegistry` instances, threads, and unrelated callers in the
same process.
- `shared-memory://bucket-a/...` and `shared-memory://bucket-b/...`
resolve to separate backends and remain isolated.
- `memory://` is **unchanged**, so existing tests that rely on per-call
isolation are unaffected.

The cache lives in a process-global `LazyLock<Mutex<HashMap<String,
Arc<InMemory>>>>` rather than on the provider instance because
`ObjectStore::from_uri` constructs a fresh `ObjectStoreRegistry` per
call — a per-provider cache wouldn't actually share state across
components.

### Why a new scheme rather than caching `memory://`

`"memory://"` (no path) is reused as a generic "give me a temp store"
URL across ~70+ test sites in the codebase (e.g. `Dataset::write(data,
"memory://", ...)`). With `cargo test` running in parallel, adding
caching to `memory://` would cause cross-test contamination at the
dataset root. Opt-in via a new scheme avoids any migration churn.

### Use case

Tests and harnesses that need multiple actors to coordinate through a
common in-memory object store — e.g. a writer and an independent reader,
multi-pod fence simulations, manifest-store + WAL-writer agreement on
the same backing bytes.

## Test plan

- [x] `cargo test -p lance-io --lib shared_memory` — 5 new tests pass
(cross-registry sharing, per-authority isolation, path extraction,
prefix uniqueness, end-to-end via `from_uri_and_params`)
- [x] `cargo test -p lance-io --lib memory` — existing `memory://` tests
still pass
- [x] `cargo clippy -p lance-io --tests -- -D warnings` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

## Summary

Adds three opt-in `ShardWriter` APIs that let callers (tests, custom
compaction loops, supervised flush flows) drive memtable flushing and
fence detection synchronously rather than waiting on automatic triggers
or the next manifest round-trip.

- `force_seal_active()` — freezes the active memtable and enqueues it
for L0 flush. No-op on an empty memtable; errors in WAL-only mode.
- `wait_for_flush_drain()` — blocks until every frozen memtable in the
L0 flush queue has landed and been recorded in the manifest. Pair with
`force_seal_active` for "everything-on-disk" semantics. Does not wait on
the active memtable.
- `check_fenced()` — surfaces a successor writer's higher-epoch claim
immediately, without waiting for the next manifest put.

All three are pure additions on `ShardWriter`; no existing behavior
changes.

## Test plan

- [x] `cargo check -p lance --tests`
- [x] `cargo clippy -p lance --tests -- -D warnings`
- [x] New `test_force_seal_active_and_wait_for_flush_drain` verifies
seal + drain advances the memtable generation and appends a flushed
generation to the manifest
- [x] New `test_check_fenced_detects_successor_claim` verifies fence is
observable from the loser writer before any subsequent put

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormat#6786)

Updated filename suffix from '.lance' to '.arrow' for WAL entries.
## Summary
- Add `DeleteBuilder::execute_uncommitted()` so deletes can be staged
and committed later with `CommitBuilder`.
- Return an `UncommittedDelete` wrapper containing the delete
`Transaction`, `affected_rows`, and `num_deleted_rows` so staged commits
can preserve row-level conflict rebasing.
- Factor delete transaction construction into a helper shared by
committed and uncommitted delete paths.

## Test plan
- `cargo fmt --all`
- `cargo test -p lance
dataset::write::delete::tests::test_delete_execute_uncommitted_preserves_affected_rows_for_rebase
--lib`
- `cargo test -p lance
dataset::write::delete::tests::test_delete_false_predicate_still_commits
--lib`
- `cargo test -p lance
dataset::write::delete::tests::test_concurrent_delete_with_retries
--lib`
- `cargo test -p lance
dataset::write::delete::tests::test_delete_concurrency --lib`
- `cargo test -p lance dataset::write::delete::tests --lib`

Fixes lance-format#6658.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
…es (lance-format#6767)

## Summary

After a writer flushed a memtable to L0 and an external compactor merged
that generation into the base table — legitimately draining
`flushed_generations` to empty — a subsequent restart re-replayed the
original WAL entries into the new active memtable, duplicating rows on
read.

Two bugs were interacting:

1. **Disambiguation:** `replay_memtable_from_wal` distinguished "fresh
shard" from "flushed and compacted" via
`flushed_generations.is_empty()`. That works in a closed-world
deployment but breaks the moment an external compactor enters the
picture — and the compactor is the *intended* consumer that drains that
vector, so the signal is structurally broken under OSS-WAL.

2. **Cursor never advanced:** `MemTableFlusher::flush` read
`covered_wal_entry_position` from
`memtable.last_flushed_wal_entry_position()`, but that field is only set
by the `mark_wal_flushed` test helper. In production it stayed at 0, so
`replay_after_wal_entry_position` never advanced past 0. Under 0-based
WAL positions this masked bug #1 — both "fresh" and "post-flush-of-0"
produced cursor=0.

## Fix

- **WAL positions are now 1-based** (`FIRST_WAL_ENTRY_POSITION = 1`). A
cursor of `0` unambiguously means "no flush has stamped this shard," so
replay collapses to `cursor.saturating_add(1)` without consulting
`flushed_generations`.
- **`WalFlushHandler::handle`** writes the just-appended position back
into `state.last_flushed_wal_entry_position` under the state lock before
signalling the completion cell.
- **`MemTableFlusher::flush` / `flush_with_indexes`** now take an
explicit `covered_wal_entry_position` arg. The production caller derives
it per-memtable from the `WalFlushResult` carried in the completion cell
— authoritative under concurrent flushes — falling back to
`memtable.frozen_at_wal_entry_position()` when freeze did not trigger a
flush.
- **State seed at open** uses the post-replay WAL tip, not
`manifest.wal_entry_position_last_seen` (the latter is bumped on every
tailer read and can sit above any flushed generation).
- Proto field docs on `ShardManifest.replay_after_wal_entry_position` /
`wal_entry_position_last_seen` updated to spell out the 1-based
convention and what default-0 means.

## Test plan

- [x] Added
`test_memtable_replay_skips_entries_after_external_compaction` in
`rust/lance/src/dataset/mem_wal/write.rs`: open writer, put rows, close
(flush), simulate the compactor by directly committing a manifest with
empty `flushed_generations`, reopen, assert the memtable is empty. Fails
on the pre-fix code; passes now.
- [x] `cargo test -p lance --lib dataset::mem_wal` — 236/236 pass
- [x] `cargo test -p lance --lib` — 1600/1600 pass
- [x] `cargo test -p lance-index --lib` — 302/302 pass
- [x] `cargo clippy --all --tests --benches -- -D warnings` — clean
- [x] `cargo fmt --all -- --check` — clean

## Compatibility

WAL position numbering changes from 0-based to 1-based. Existing on-disk
manifests / WAL files written by the prior `oss-wal-multiplex` code are
not migrated — coordinated with downstream consumers (sophon) to start
fresh.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urces (lance-format#6761)

## Summary

`LsmVectorSearchPlanner::plan_search` produced wrong or no output
whenever more than one LSM source contributed candidates. Three bugs
combined to swallow the `_distance` column:

1. **Schema mismatch panic.** The base/flushed arm calls
`scanner.fast_search()`, which forces `_rowid` into Lance's projection
(`[id, vector, _rowid, _distance]`). The active-memtable arm produces
`[id, vector, _distance]`. After `MemtableGenTagExec` adds
`_memtable_gen`, `UnionExec` rejects the 5-vs-4 mismatch and the query
panics before any `_distance` reaches the caller.
2. **Silent partition loss.** `UnionExec` emits one partition per input
source. Both `SortExec` and `FilterStaleExec` only read partition 0 of
their multi-partition input, so KNN hits (and their `_distance` values)
from every source past the first silently vanished.
3. **Internal columns leaked.** `MemtableGenTagExec` was wrapped
unconditionally, so `_memtable_gen` ended up in user-visible output even
when no bloom filters were configured.

### Fixes

- Add `project_to_canonical` to normalize every KNN source to
`[projection..., _distance]` before union. This drops `_rowid` from the
base/flushed arms so all inputs share one schema.
- Insert `CoalescePartitionsExec` before `SortExec`, and again before
`FilterStaleExec` on the bloom-filter path, so the merge sees every KNN
candidate.
- Only wrap with `MemtableGenTagExec` when bloom filters are configured;
project `_memtable_gen` back out after `FilterStaleExec` so the public
schema stays clean.
- Route the active arm through `build_projection_for_knn` so explicit
user projections don't drop the PK that staleness detection needs.

## Test plan

- [x] `test_vector_search_base_plus_active_returns_distance` —
end-to-end search across base + active memtable; asserts `_distance`
present, nearest neighbor has near-zero distance, and `_rowid` /
`_memtable_gen` are NOT in the output.
- [x] `test_vector_search_with_projection_returns_distance_and_pk` —
regression for the active-arm projection path; PK column auto-included
when user projects only `vector`.
- [x] `test_vector_search_with_bloom_filter_strips_memtable_gen` —
exercises the bloom-filter branch; verifies the post-`FilterStaleExec`
projection strips `_memtable_gen` and that union partitions feed
correctly into the filter.
- [x] All existing `mem_wal` tests pass.
- [x] `cargo clippy -p lance --tests --benches -- -D warnings` clean.

\U0001f916 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t#6740)

When we enable stable row ID, if we update rows, we can not scan updated
rows with a limit condition.

<img width="1160" height="418" alt="image"
src="https://github.com/user-attachments/assets/86a9dfbc-bed9-440d-b784-ecc2b638d953"
/>

<img width="1268" height="542" alt="image"
src="https://github.com/user-attachments/assets/4b338fcb-9891-4377-9fef-b2302e72dc03"
/>
## Summary

A blob column can be decoded under two distinct shapes that share the
same physical column:

- the **descriptor** view (`Struct<position, size>`) used when scanning
with `BlobHandling::BlobsDescriptions` (the default for `to_table` and
`take_blobs`);
- the **bytes** view (`LargeBinary`) used by `BlobHandling::AllBinary`
and `compact_files`.

Both paths build a `StructuralPrimitiveFieldScheduler`, but the
page-level schedulers they instantiate cache different concrete
`CachedPageData` types. The cache key (`FieldDataCacheKey`) carried only
`column_index`, so when one shape populated the cache and a second
reader on the same `Dataset` / `Session` hit it with the other shape,
`BlobPageScheduler::load` would downcast the wrong state and panic:

```
panicked at rust/lance-encoding/src/encodings/logical/primitive/blob.rs:335:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
```

This change mixes a Debug-formatted representation of the target field's
`DataType` into the cache key. The two blob views now get their own
entries while non-blob columns keep sharing as before (their `DataType`
is invariant for a given field).

The accompanying regression test
(`test_blob_cache_key_distinguishes_views`) runs three back-to-back
scans on a single `Arc<Dataset>` — descriptor view, bytes view,
descriptor view — and asserts they all succeed and return the expected
bytes. Without the fix the second scan panics; with it the test passes.

Co-authored-by: Vova Kolmakov <wombatukun@apache.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
…ce-format#6805)

## Summary

The `shared-memory://` object store provider (added in lance-format#6753) was gated
behind `#[cfg(test)]`, so it only compiled for this project's own test
builds. Downstream crates and integration harnesses depending on
`lance-io` could never resolve the scheme — which defeats the purpose of
having a shareable, cross-component in-memory store.

This removes the `#[cfg(test)]` gates from the four production sites
required for the provider to function:

- **`providers.rs`** — module declaration + registration in
`ObjectStoreRegistry::default()`
- **`object_store.rs`** — `is_cloud()` classification (folded into the
existing local/`memory` check)
- **`commit.rs`** — commit-handler routing to
`ConditionalPutCommitHandler` (folded into the existing cloud-scheme
arm)

The provider stays strictly **opt-in**: callers must explicitly use the
`shared-memory://` scheme, so existing `memory://` per-call isolation is
unchanged. The in-module `#[cfg(test)] mod tests` remains test-only.

## Testing

- `cargo check -p lance-io -p lance-table` — clean
- `cargo clippy -p lance-io -p lance-table --tests -- -D warnings` —
clean
- `cargo test -p lance-io shared_memory` — 5/5 pass
- `cargo test -p lance-table
test_commit_handler_from_url_memory_schemes` — pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

This PR adds a Lance-native in-memory HNSW implementation for MemWAL and
wires it into the async `ShardWriter` path.

The benchmark shape follows @jackye1995's suggestion to use a
FineWeb-like baseline: a text payload around 1.5 KiB plus a 1024-dim
`f32` vector column, or about 5760 bytes per row. I used hnswlib's
construction hot path as a reference while adapting the implementation
to Lance MemWAL's requirements: multi-reader/single-writer access,
reader-visible publication during writes, and reuse of the vector data
already held by the MemTable instead of copying vectors into a separate
HNSW-owned buffer.

Main changes:
- add `rust/lance/src/dataset/mem_wal/hnsw/` with a MemWAL-oriented HNSW
graph and Arrow-backed vector storage
- update `mem_wal/index/hnsw.rs` to use the new graph for active
MemTable vector search
- add WAL queue stats so benchmarks can distinguish WAL flush lag from
final close/drain work
- add native Rust benchmarks under `rust/lance/benches/`, including
side-by-side HNSW/hnswlib comparison scaffolding and the shard-writer
WAL backpressure benchmark
- add `--schema-shape fineweb|vector_only` to the native shard-writer
benchmark so we can run both the FineWeb-shaped case and the older small
`id + 1024-dim vector` case requested in review

## Benchmark Summary So Far

Baseline before this work was roughly:
- safe durable async+index throughput: ~3.66 MB/s (`batch=50`, 512 KiB
WAL, no backpressure)
- previous peak with manageable backpressure: ~6.17 MB/s (`batch=100`, 2
MiB WAL, ~24s drain)
- bottleneck identified as active in-memory HNSW insert throughput, not
S3 bandwidth

Best current WAL-queue results on `c7i.16xlarge`, S3 bucket
`jack-devland-build`, FineWeb-shaped rows, `--skip-close`, and explicit
WAL queue stats:

| batch | WAL | target rows/s | actual rows/s | MB/s | final WAL queue |
max WAL queue | thread setting |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| 10 | 16 MiB | 8000 | 8000 | 46.08 | 0.310% | 1.526% | rayon=48,
tokio=16 |
| 100 | 8 MiB | 6000 | 6000 | 34.56 | 0.130% | 1.070% | rayon=64,
tokio=16 |
| 1000 | 8 MiB | 8000 | 8000 | 46.08 | 0.200% | 1.500% | rayon=64,
tokio=8 |

Interpretation: for these paced single-shard runs, WAL flush can keep up
at 34-46 MB/s without accumulating material WAL backlog. This is about
7.5x over the previous 6.17 MB/s peak and about 12.6x over the earlier
3.66 MB/s no-backpressure point on the same FineWeb-shaped workload.

Thread ablation suggests high Tokio worker counts are not required for
this path; the important CPU knob is the Rayon pool used by async index
construction. Moderate Tokio settings (`4-16`) were enough in the tested
single-writer workload.


Small-schema follow-up for @jackye1995's comment, using `--schema-shape
vector_only` (`id + 1024-dim f32 vector`, default row estimate 4160
bytes) on the 12xlarge EC2 runner, S3 bucket `jack-devland-build`,
`--skip-close`, and the same WAL-queue stats:

| batch | WAL | target rows/s | actual rows/s | MB/s | p99 ms | final
WAL queue | max WAL queue | thread setting |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| 10 | 16 MiB | 8000 | 7999.855 | 33.279 | 0.020 | 0.481% | 1.998% |
rayon=64, tokio=64 |
| 10 | 16 MiB | 11000 | 10999.782 | 45.759 | 0.019 | 0.481% | 2.495% |
rayon=64, tokio=64 |
| 100 | 8 MiB | 6000 | 5999.954 | 24.960 | 0.042 | 0.340% | 1.400% |
rayon=64, tokio=32 |
| 100 | 8 MiB | 9000 | 8999.850 | 37.439 | 0.045 | 0.340% | 2.100% |
rayon=64, tokio=32 |
| 1000 | 8 MiB | 8000 | 7999.913 | 33.280 | 0.084 | 0.100% | 1.800% |
rayon=64, tokio=8 |
| 1000 | 8 MiB | 11000 | 10999.909 | 45.760 | 0.154 | 4.600% | 4.600% |
rayon=64, tokio=8 |

This is not a full max-throughput sweep for the small schema, but it
confirms the benchmark can run the requested older shape and that it
reaches the same ~45.8 MB/s region at 11k rows/s, with the `batch=1000`
high-rate case starting to show more WAL queue backlog than the smaller
batches.

## Analysis Artifacts

Saved local analysis:
-
`/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/wal-queue-all-runs-20260515/RESULTS.md`
-
`/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/thread-ablation-20260515/RESULTS.md`
-
`/Users/jackye/ai/analysis/lance/jack-mem-wal-hnsw/vector-only-pr-20260515/`

S3 result artifacts:
-
`s3://jack-devland-build/memwal-walqueue-panel-20260514T171240Z/_bench_results/`
-
`s3://jack-devland-build/memwal-walqueue-supplement-20260514T182826Z/_bench_results/`
-
`s3://jack-devland-build/memwal-walqueue-higher-20260515T001817Z/_bench_results/`
-
`s3://jack-devland-build/memwal-thread-ablation-20260515T060229Z/_bench_results/`
-
`s3://jack-devland-build/memwal-vector-only-pr-20260515T090611Z/_bench_results/`

## Validation

- `cargo fmt --all --check`
- `cargo fmt --manifest-path python/Cargo.toml --all --check`
- `cargo check -p lance --bench mem_wal_shard_writer_backpressure`
- `cargo check --manifest-path python/Cargo.toml --lib`
- `cargo clippy -p lance --tests -- -D warnings`
- `cargo test -p lance
dataset::mem_wal::index::hnsw::tests::test_index_concurrent_insert_and_search
-- --exact`

cc @hamersaw

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
PEP 735 dependency-groups in pyproject.toml cause maturin to pass
--group to pip, but the venv-created pip (24.0 bundled with Python 3.11)
does not support this flag. Switch to uv venv + maturin develop --uv.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.