Add hadamard rotation to vector fields by shubhamvishu · Pull Request #16092 · apache/lucene

shubhamvishu · 2026-05-21T00:32:52Z

Description

I worked with CC(Claude Code; did a great job in all phases from initial impl to testing) to have this PR which adds the Hadamard rotation(Fast Walsh Hadamard Transform) to vector fields(default false; configurable codec param; no codec bump required) inspired from the @xande 's TurboQuant PR (who works with me on Amazon Product Search) but a stripped down version just adding rotation to vectors in isolation. This address the 2nd item Implement random rotation of vectors and queries. from Data-blind scalar quantization issue @mccullocht is working on.

I'm opening this to gather community feedback, as it shows promising recall improvements. I'd like to see whether we want to incorporate this into Lucene, reuse some of these ideas, or discard the approach if there are concerns.

The shows upto ~5-7% recall improvement in luceneutil benchmarks with Cohere V3 and Amazon's internal 4K dim vector embeddings. Current approach rotates the incoming float vectors at insertion (so we index the vectors in rotated space in .vec file) and rest of the flow continues as is. It stores whether to do rotation for a vector field or not info in the FieldInfos. At query time, it checks if the field has rotation enabled and rotates the query it true.

TL;DR : Randomized orthogonal rotation (sign flips + permutation + FWHT) that Gaussianizes vector dimensions distributions to favor the scalar quantization(OSQ) accuracy while preserving distances.

Here's an ASCII diagram Claude generated explaining the Hadamard rotation steps :

 HadamardRotation: out = R · in
 ════════════════════════════════

Input vector (arbitrary distribution)
┌─────────────────────────────────────────────────────────┐
│  in[0]   in[1]   in[2]   in[3]   in[4]  ...  in[d-1]    │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 1: Random Sign Flips                              │
│                                                         │
│  For each component, randomly negate (±1):              │
│    scratch[i] = signFlips[i] ? -in[i] : in[i]           │
│                                                         │
│  Purpose: breaks correlations between components        │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 2: Random Permutation (Fisher-Yates shuffle)      │
│                                                         │
│  Rearrange components randomly:                         │
│    out[perm[i]] = scratch[i]                            │
│                                                         │
│  Purpose: distributes information across all positions  │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 3: Block-Diagonal FWHT (in-place on out[])        │
│                                                         │
│  For non-power-of-2 dims, decompose into blocks:        │
│    e.g. dim=768 → [512] [256]                           │
│                                                         │
│  ┌──────────────────────┐ ┌────────────────┐            │
│  │  FWHT on block (512) │ │ FWHT on (256)  │            │
│  │                      │ │                │            │
│  │   ╱╲    butterfly    │ │   ╱╲           │            │
│  │  ╱  ╲   operations   │ │  ╱  ╲          │            │
│  │ x+y  x-y            │ │ x+y  x-y      │              │
│  │                      │ │                │            │
│  │  × 1/√512 normalize  │ │ × 1/√256       │            │
│  └──────────────────────┘ └────────────────┘            │
│                                                         │
│  Purpose: mixes components → Central Limit Theorem      │
│           drives distribution toward Gaussian           │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
Output vector (≈ Gaussian distribution per component)
┌─────────────────────────────────────────────────────────┐
│ out[0]  out[1]  out[2]  out[3]  out[4]  ...  out[d-1]   │
└─────────────────────────────────────────────────────────┘


KEY PROPERTIES:
═══════════════
• Orthogonal:  ‖Rx‖ = ‖x‖         (preserves L2 norm)
               (Rx)·(Ry) = x·y     (preserves dot products)

• Deterministic: same (dim, seed) → same rotation always

• Cost: O(d log d) for FWHT + O(d) for signs & permutation

• Invertible: R⁻¹ = Rᵀ  (just reverse the 3 steps)


WHY IT HELPS QUANTIZATION:
══════════════════════════

Before rotation:              After rotation:
(skewed/sparse/uniform)       (≈ Gaussian)

 ▌                               ▄
 ▌                              ▄█▄
 ▌▌                           ▄▄███▄▄
 ▌▌▌                        ▄▄██████▄▄▄
▄▌▌▌▌▄▄▄▄▄▄▄▄▄▄▄▄▄▄     ▄▄▄████████████▄▄▄
─────────────────────     ─────────────────────
→ Quantization bins       → Quantization bins
  poorly utilized            evenly utilized

Cohere V3

Baseline (`main`) :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.695        2.413   2.409        0.998     1 bits     9589     96.07       5204.65          110.07         2102.03      2020.836       67.711
 0.816        2.512   2.510        0.999     2 bits     8614     98.17       5093.00          137.36         2156.30      2081.871      128.746
 0.918        2.711   2.709        0.999     4 bits     8221     96.06       5205.08          135.73         2276.80      2204.895      251.770
 0.970        3.789   3.788        1.000     7 bits     8129    101.13       4944.28          228.06         2520.64      2449.036      495.911
 0.976        3.804   3.803        1.000     8 bits     8129    101.92       4905.71          230.64         2520.66      2449.036      495.911

Candidate (`main` + rotation i.e. this PR) :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.729        2.243   2.241        0.999     1 bits     9174     96.59       5176.47          100.21         2099.73      2020.836       67.711
 0.841        2.478   2.474        0.998     2 bits     8468     97.67       5119.12          128.30         2155.75      2081.871      128.746
 0.937        2.613   2.612        1.000     4 bits     8162    100.38       4980.97          124.06         2276.62      2204.895      251.770
 0.982        3.704   3.702        0.999     7 bits     8124    102.14       4895.48          227.00         2520.65      2449.036      495.911
 0.985        3.685   3.684        1.000     8 bits     8120     99.64       5017.91          227.74         2520.64      2449.036      495.911

Recall

Bits	Baseline	Rotation	Delta	% Gain
1	0.695	0.729	+0.034	+4.9%
2	0.816	0.841	+0.025	+3.1%
4	0.918	0.937	+0.019	+2.1%
7	0.970	0.982	+0.012	+1.2%
8	0.976	0.985	+0.009	+0.9%

A lightweight wrapping FlatVectorsFormat that applies a randomized Hadamard rotation to vectors before handing them to a delegate format (e.g. Lucene104ScalarQuantizedVectorsFormat), and rotates query vectors at search time. Because the rotation is orthogonal, dot product, cosine similarity, and Euclidean distance are all preserved, so the delegate's similarity math is unchanged. The rotation redistributes variance across dimensions, which makes OSQ's assumption of Gaussian components hold on datasets whose raw components are skewed or uniform (image pixels, histograms, non-transformer embeddings). Motivation and approach come from the discussion in Apache Lucene PR apache#15903 (TurboQuant) and Elastic's April 2026 blog on BBQ preconditioning, which measured 41-74% recall improvements on GIST / SIFT / Fashion-MNIST at ~2-4% query overhead. Implementation: - HadamardRotation: immutable, thread-safe, O(d log d) via Fast Walsh-Hadamard Transform with random sign flips and a Fisher-Yates permutation. Supports non-power-of-2 dimensions through a block decomposition (e.g. 768 = 512 + 256). Provides both forward and inverse rotations. - RotationPreconditionedVectorsFormat: public FlatVectorsFormat with a no-arg constructor (required for SPI) that defaults to wrapping Lucene104ScalarQuantizedVectorsFormat, plus constructors that take a custom delegate and seed. - RotationPreconditionedVectorsWriter: intercepts addValue to rotate each vector before forwarding to the delegate's field writer. Byte vectors pass through unchanged. - RotationPreconditionedVectorsReader: rotates float query vectors before scoring, inverse-rotates stored vectors in getFloatVectorValues for rescore/CheckIndex callers, and exposes the raw rotated values via getMergeInstance() so that the delegate's merge runs entirely in rotated space (preserving byte-copy merge where the delegate supports it). - Global deterministic rotation per (dim, seed): the same rotation across all segments enables byte-copy merges in the underlying format. SPI wiring: - META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat registers the new format alongside FaissKnnVectorsFormat. - module-info.java exports the package and lists the format under the KnnVectorsFormat 'provides' clause. Tests: - TestHadamardRotation: 11 unit tests covering orthogonality (L2 norm preservation, dot product preservation, Euclidean distance preservation), determinism (same seed -> same rotation), non-identity (different seeds differ), block decomposition correctness, and the spreading property on concentrated inputs. - TestRotationPreconditionedVectorsFormat: extends BaseKnnVectorsFormatTestCase and runs the full suite of KNN-format correctness tests (merge, sort, delete, multi-field, mismatched fields, random exceptions, etc.). Eight tests in the base suite assert bit-exact round-trip equality of indexed vectors; those are overridden with explanatory comments because rotate+inverse_rotate introduces ~1e-7 floating-point drift. Search correctness is unaffected because the rotation is orthogonal. All sandbox tests pass (336 tests, 64 pre-existing skips).

Refactors the rotation preconditioner out of a sandbox FlatVectorsFormat wrapper and into the existing Lucene104ScalarQuantizedVectorsFormat (OSQ) as an opt-in feature controlled by a rotationSeed constructor argument. Rotation is now a first-class capability of OSQ: when a non-zero seed is supplied, every incoming vector is Hadamard-rotated before centroid computation and quantization, and every query is rotated the same way at search time. Because the rotation is orthogonal, all similarity functions (dot product, cosine, Euclidean) are preserved, but per-coordinate distributions become much more Gaussian — which makes OSQ's initialization assumption hold on datasets with skewed or uniform components (image pixels, histograms, non-transformer embeddings). Motivation comes from Apache Lucene PR apache#15903 (TurboQuant) discussion and Elastic's April 2026 blog on BBQ preconditioning, which measured 41-74% recall improvements on GIST / SIFT / Fashion-MNIST at ~2-4% query overhead. Changes: - Move HadamardRotation + its test from lucene/sandbox to lucene/core/src/java/org/apache/lucene/util/quantization/ so it lives next to the existing OptimizedScalarQuantizer. - Lucene104ScalarQuantizedVectorsFormat: add a rotationSeed constructor parameter (default ROTATION_DISABLED = 0 preserves existing behaviour). Bump the on-disk format to VERSION_PRECONDITIONED (1). Old segments (version 0) are still readable; their seed is implicitly 0. - Lucene104HnswScalarQuantizedVectorsFormat: add a matching ctor overload so the HNSW wrapper can enable preconditioning. - Lucene104ScalarQuantizedVectorsWriter: constructor takes the seed; FieldWriter.addValue rotates the incoming vector up front so all downstream OSQ math (centroid accumulation, raw storage, quantization) runs in the rotated basis. writeMeta persists the seed. - Lucene104ScalarQuantizedVectorsReader: FieldEntry now carries rotationSeed; readField reads it when the version supports it. getRandomVectorScorer(String, float[]) rotates the query before scoring. getFloatVectorValues wraps the raw delegate with an InverseRotatedFloatVectorValues so external callers (rerank, CheckIndex, FieldExistsQuery, etc.) see the original vectors they indexed. getMergeInstance() returns a lightweight MergeReader that skips the inverse rotation — the downstream merge then operates entirely in rotated space, preserving consistency across segments. - Remove the sandbox/rotation package and its tests; revert the sandbox module-info and SPI service registration. - Update OSQ and HNSW toString() tests to include rotationSeed. Add TestLucene104ScalarQuantizedVectorsFormatPreconditioning covering end-to-end search with rotation enabled, round-tripping vectors through rotate+inverseRotate via getFloatVectorValues, seed=0 equivalence to the default format, and toString observability. All existing OSQ flat/HNSW/backward-compat tests continue to pass. The 4 new preconditioning tests and the 11 HadamardRotation math tests pass.

Replaces the previous attempt that modified Lucene104 in place. Since Lucene104 is a shipped codec with a frozen on-disk format, any layout change belongs in a new codec family. This commit: - Restores Lucene104ScalarQuantizedVectorsFormat (and the matching HNSW wrapper / writer / reader / tests) to their exact pre-patch state. Anyone with a Lucene104 index can still read it byte-for-byte the same as before. - Introduces Lucene105ScalarQuantizedVectorsFormat + the HNSW wrapper as a new codec family (package org.apache.lucene.codecs.lucene105). The codec-name headers and internal NAME strings all use 'Lucene105' so the new layout can be distinguished at read time. File extensions (.veq, .vemq) are the same because the codec-name header in each file is what disambiguates. - Adds rotation preconditioning natively to Lucene105 as an opt-in feature controlled by a rotationSeed constructor argument: * Default / sentinel value ROTATION_DISABLED (0) keeps the format layout shape matching Lucene104 aside from one extra long per field in metadata. * A non-zero seed enables Hadamard rotation at index and query time. The rotation is orthogonal so dot product / cosine / Euclidean distances are preserved end to end; what changes is the per-coordinate distribution of the stored vectors, which becomes much more Gaussian. This helps OSQ initialization on datasets with skewed / uniform components (image pixels, histograms, non-transformer embeddings). * The seed is persisted in per-field metadata. Reader rotates queries in getRandomVectorScorer, inverse-rotates stored values in getFloatVectorValues (so external rerank / CheckIndex / FieldExistsQuery callers see the original vectors), and exposes an unrotated view via getMergeInstance so merges stay in the rotated basis end to end. - Clones the scorer (Lucene105ScalarQuantizedVectorScorer) and the two Off-heap value classes (OffHeapScalarQuantizedVectorValues, OffHeapScalarQuantizedFloatVectorValues) into the new package so the Lucene104 package-private members don't have to be made public for the Lucene105 codec to use them. HadamardRotation lives once, in lucene/core/src/java/org/apache/lucene/util/quantization/, because it's a utility rather than a codec. - Registers Lucene105ScalarQuantizedVectorsFormat and Lucene105HnswScalarQuantizedVectorsFormat via SPI (META-INF services file and the module-info 'provides' clause), and exports the new package. - Adds TestLucene105ScalarQuantizedVectorsFormatPreconditioning with four targeted tests covering end-to-end preconditioned search, vector round-trip through rotate+inverseRotate via getFloatVectorValues, seed=0 equivalence to the default format, and toString observability. All existing Lucene104 OSQ tests, Lucene105 preconditioning tests, HadamardRotation math tests, backward-compat Lucene99 OSQ tests, and sandbox tests pass. Usage: // Pick the old codec for backward compat. new Lucene104ScalarQuantizedVectorsFormat(); // Pick the new codec with no rotation (default). new Lucene105ScalarQuantizedVectorsFormat(); // Pick the new codec with rotation preconditioning enabled. new Lucene105ScalarQuantizedVectorsFormat( ScalarEncoding.UNSIGNED_BYTE, 0x5eedCafeBabeL);

…ring merge During force_merge, the HNSW graph builder needs to compare documents against each other. For 1-bit and 2-bit (asymmetric) encodings, this requires building a temporary 4-bit "query" representation of each document by reading back its float vector and re-quantizing it against the segment centroid. The bug: getRandomVectorScorerSupplierForMerge() called getFloatVectorValues(), which inverse-rotates the stored vectors back to original space (designed for external callers). These un-rotated vectors were then quantized against the centroid, which lives in rotated space (computed from rotated vectors during indexing). The centering step (vector[i] - centroid[i]) mixed original-space vectors with a rotated-space centroid, producing meaningless 4-bit representations. The HNSW graph built from these scores was essentially random, dropping recall from 0.695 to 0.050 (1-bit) and 0.816 to 0.055 (2-bit). Only 1-bit and 2-bit are affected. 4-bit, 7-bit, and 8-bit use symmetric scoring which reads already-quantized bytes directly — no float vectors involved, no rotation mismatch possible. The fix: use rawVectorsReader.getFloatVectorValues() to read the stored rotated vectors directly, matching the rotated centroid. Indices built with the buggy code have corrupted HNSW graphs for 1-bit and 2-bit segments and need reindexing or re-merging. Benchmark results (Cohere v3 1024d, 500K docs, DOT_PRODUCT): bits baseline before-fix after-fix 1 0.695 0.050 0.729 2 0.816 0.055 0.841 4 0.918 0.937 0.937 7 0.970 0.982 0.982 8 0.976 0.985 0.985

This reverts commit 305839b.

shubhamvishu · 2026-05-21T00:36:56Z

I also ran the same luceneutil with 4K dimensional vectors and I see even higher impact to recall(~6-7% improvement) overall net-net with slight slowness in indexing-rate(~5%) due to rotation overhead.

With internal Amazon 4K vectors embeddings

Setup:

500K docs, 4096 dimensions, DOT_PRODUCT similarity, HNSW (maxConn=64, beamWidth=250), 10K queries.

Bits	Avg Baseline Recall	Avg Candidate Recall(Rotation)	Avg Delta	% Diff
1	0.828	0.889	+0.061	+7.4%
2	0.858	0.916	+0.058	+6.7%
4	0.893	0.958	+0.066	+7.3%
7	0.920	0.972	+0.052	+5.6%
8	0.927	0.974	+0.047	+5.1%
All bits	0.885	0.942	+0.057	+6.4%

Metric	Baseline	Rotation	% Diff	Impact
Recall (avg all)	0.885	0.942	+6.4%	Improvement
Search latency	~2.0 ms	~2.0 ms	~0%	No change
Index rate	~3830 docs/s	~3620 docs/s	-5.5%	Slightly slower
Index size	8091-9799 MB	8091-9799 MB	0%	Identical
Force merge time	~213 s	~194 s	-8.9%	No regression

mccullocht · 2026-05-21T03:28:26Z

+    FieldInfo info = fieldInfos.fieldInfo(field);
+    return info != null
+        && "true"
+            .equals(info.getAttribute(Lucene104ScalarQuantizedVectorsFormat.ROTATION_ENABLED_KEY));


I haven't ever seen attributes used to enable codec features like this, at least it doesn't appear to be common practice in the core codecs. More typically this would tick VERSION_CURRENT and something (probably just the rotation seed?) would be encoded as part of field metadata. I don't have a good sense as to why maybe @mikemccand or @benwtrent has a firmer and better reasoned opinion.

Yeah I also reasonate here. The initial version (1st commit) tried to retain it in the sandox module -> then moved it into the main core Codec but it was writing a random seed to the metadata of the field and I had to bump the codec to 105. I honestly didn't wanted to bump the Codec for this given usecase so I switched to use a constant seed and assuming rotation is enabled for all vector fields by default(which obviously simplified all of this but takes away the capability from user to configure it on per field basis). So eventually I biased towards sharing the rotationEnabled flag (configured per vector field) to query time via FieldInfos and avoiding Codec bump since this way we were not breaking the backward compatibility and also how simple it was in nature. I'm open to ideas whichever we would want to choose or if there is a better way to share this info(rotation enabled/disabled) hopefully avoiding the codec bump.

mccullocht · 2026-05-21T03:33:58Z

+    if (isRotationEnabled(field) && target != null) {
+      HadamardRotation rotation = rotationFor(field, fi.dimension);
+      float[] rotated = new float[target.length];
+      rotation.rotate(target, rotated);


IIUC this rotation operation is probably fairly expensive -- at least as expensive as quantization but possibly even more expensive. In Lucene's segment structure this operation will be repeated for every segment searched. In your tests you probably ran with everything merged down to a single segment but I'm interested in what this costs in a more typical multi-segment setup. A microbenchmark for the rotation or a full luceneutil run would be helpful here.

Depending on the cost we might want to figure out a way to reuse this computation across segments somehow which probably requires upstream integration with the knn query classes.

rotate is a O(d log d) operation here. I ran the luceneutil without forceMerge too but I don't have the Cohere results handy anymore(I can pull those up again). Though I have the results with 4K dim embeddings in multi segment index. Sharing those below for you reference(will share the Cohere one also soon). I didn't seem to regress the latency as such. We can do more JMH benchmarking or whatever could give us more confidence but so far it appears to be a cheap cost(Idk if the quantization overhead was significant in past).

We cannot do a rotation per segment, that is a non starter.

This would significantly impact latency in the most common scenario, which is many segments.

Instead I suggest queries need to have an "additional phase" that looks to see if any of the KnnFormats can apply a "globalPrecondition" step or something and then apply it once for the query.

To see the significance of the performance impact, you will need many vectors spread over many segments (which is very common, I mean 10s of millions spread over 30-50+ segments).

@benwtrent I see, yeah I think moving in upstream and doing it once makes sense to me. Maybe just moving it to KnnQuery as Trevor mentioned below?

To see the significance of the performance impact, you will need many vectors spread over many segments

That could be the case yes. Sharing below the cohere v3 run I put with multi segment and it seems to have not much impact but larger segments of 30M might move the needle.

Result : Cohere V3 without forceMerge

Baseline :

Results: NOTE: nDoc = 500000 for all runs; skipping column NOTE: searchType = KNN for all runs; skipping column NOTE: topK = 100 for all runs; skipping column NOTE: fanout = 100 for all runs; skipping column NOTE: resultSimilarity = N/A for all runs; skipping column NOTE: decay = N/A for all runs; skipping column NOTE: resultCount = 100.000 for all runs; skipping column NOTE: maxConn = 64 for all runs; skipping column NOTE: beamWidth = 250 for all runs; skipping column NOTE: force_merge(s) = 0.00 for all runs; skipping column NOTE: filterStrategy = null for all runs; skipping column NOTE: filterSelectivity = N/A for all runs; skipping column NOTE: overSample = 1.000 for all runs; skipping column NOTE: bp-reorder = false for all runs; skipping column NOTE: indexType = HNSW for all runs; skipping column NOTE: rerank = no for all runs; skipping column recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) 0.695 1.916 5.285 2.758 1 bits 54581 94.09 5314.12 5 2092.52 2020.836 67.711 0.817 1.887 6.052 3.208 2 bits 50314 95.73 5223.08 5 2147.02 2081.871 128.746 0.921 2.107 7.275 3.453 4 bits 50877 97.97 5103.60 6 2266.48 2204.895 251.770 0.976 3.300 12.843 3.892 7 bits 61241 101.92 4906.00 9 2506.59 2449.036 495.911 0.983 3.280 12.810 3.905 8 bits 61080 99.65 5017.51 9 2506.62 2449.036 495.911

Candidate:

Results: NOTE: nDoc = 500000 for all runs; skipping column NOTE: searchType = KNN for all runs; skipping column NOTE: topK = 100 for all runs; skipping column NOTE: fanout = 100 for all runs; skipping column NOTE: resultSimilarity = N/A for all runs; skipping column NOTE: decay = N/A for all runs; skipping column NOTE: resultCount = 100.000 for all runs; skipping column NOTE: maxConn = 64 for all runs; skipping column NOTE: beamWidth = 250 for all runs; skipping column NOTE: force_merge(s) = 0.00 for all runs; skipping column NOTE: filterStrategy = null for all runs; skipping column NOTE: filterSelectivity = N/A for all runs; skipping column NOTE: overSample = 1.000 for all runs; skipping column NOTE: bp-reorder = false for all runs; skipping column NOTE: indexType = HNSW for all runs; skipping column NOTE: rerank = no for all runs; skipping column recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) 0.729 1.845 5.056 2.740 1 bits 51755 96.87 5161.40 5 2090.45 2020.836 67.711 0.843 1.870 5.904 3.157 2 bits 49123 98.38 5082.18 5 2146.54 2081.871 128.746 0.940 2.089 7.143 3.419 4 bits 50584 95.99 5209.04 6 2266.39 2204.895 251.770 0.988 3.255 12.872 3.955 7 bits 60998 100.92 4954.52 9 2506.59 2449.036 495.911 0.993 3.244 12.887 3.973 8 bits 60955 100.85 4958.01 9 2506.60 2449.036 495.911

shubhamvishu · 2026-05-21T04:44:33Z

Luceneutil with Amazon 4K vectors embeddings (forceMerge=False)

NOTE : Run 1 and 2 are on separate 4K embedding dataset(500K) so sharing both

Run 1

Baseline :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.847        1.502   6.435        4.284     1 bits    30107    122.60       4078.17            11         8090.94      8063.316      250.816
 0.873        1.753   7.827        4.464     2 bits    28103    124.40       4019.26            12         8333.67      8307.457      494.957
 0.906        2.459  10.420        4.238     4 bits    28075    123.27       4056.01            12         8821.31      8796.692      984.192
 0.931        3.068  11.445        3.730     7 bits    19281    145.29       3441.42             7         9798.58      9773.254     1960.754
 0.936        3.467  10.623        3.064     8 bits    18113    144.72       3454.88             6         9798.82      9773.254     1960.754

Candidate:

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.906        1.176   4.908        4.174     1 bits    22856    134.87       3707.16             9         8090.28      8063.316      250.816
 0.936        1.459   5.608        3.845     2 bits    20761    126.32       3958.30             9         8333.64      8307.457      494.957
 0.971        2.228   9.872        4.430     4 bits    27231    140.24       3565.42            12         8821.80      8796.692      984.192
 0.988        3.446  15.027        4.361     7 bits    25644    148.67       3363.27            10         9798.57      9773.254     1960.754
 0.989        3.164  11.303        3.573     8 bits    19280    145.93       3426.35             7         9799.14      9773.254     1960.754

Run 2

Baseline :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.813        1.145   4.208        3.676     1 bits    17977    121.83       4104.15             8         8089.82      8063.316      250.816
 0.849        1.423   6.169        4.336     2 bits    21256    115.82       4317.01            10         8332.16      8307.457      494.957
 0.885        1.805   6.790        3.761     4 bits    17201    130.58       3829.22             9         8820.67      8796.692      984.192
 0.921        2.918  12.660        4.339     7 bits    20622    136.97       3650.57            10         9796.88      9773.254     1960.754
 0.926        2.678   8.020        2.995     8 bits    13287    141.24       3540.07             5         9797.46      9773.254     1960.754

Candidate:

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.878        1.104   4.588        4.154     1 bits    20309    133.12       3756.15             9         8089.49      8063.316      250.816
 0.906        1.333   5.127        3.847     2 bits    17963    136.62       3659.87             9         8333.10      8307.457      494.957
 0.957        1.797   5.546        3.085     4 bits    15179    140.91       3548.41             6         8821.53      8796.692      984.192
 0.969        2.928  10.953        3.741     7 bits    17888    143.28       3489.79             9         9797.76      9773.254     1960.754
 0.969        3.004  11.197        3.727     8 bits    18177    136.01       3676.17             9         9797.79      9773.254     1960.754

cc - @mccullocht

benwtrent · 2026-05-21T11:35:40Z

+  private final FieldInfos fieldInfos;
+
+  /** Lazily built Hadamard rotations, keyed by field name. */
+  private final Map<String, HadamardRotation> rotations = new ConcurrentHashMap<>();


I don't know about this. This isn't cheap. We have "fieldnum_segmentssize(HadamardRotationObject)",

Seems to me the rotation matrix should just be stored off heap :/ Or the rotation is by dimension, not by field name (idk why we need a new random matrix for this for two fields that have the same dimension).

idk why we need a new random matrix for this for two fields that have the same dimension

It was because the random seed is taking the field name into picture but I agree to your point we could reuse the same matrix across vectors of same dimension actually (likely there is no/much benefit of having random seeds other than avoiding the possibility of choosing a bad seed for all vector fields but this simplification overshadows that possibility without completely discarding). I'll try to stick to a single rotation matrix for a dimension only. Do you think there is enough value of moving it off heap after having 1 rotation or it'll be an overkill?

benwtrent · 2026-05-21T11:47:57Z

Its a nice idea. I think we should strive to have a general "precondition vectors" interface. I am sure on the idea of having it integrated via field infos...I need to do some thinking here.

Two big issues besides the API that are bothering me:

Keeping a bunch of rotation matrices on heap for every segment is unnecessarily expensive
One precondition for all segments is critical.

I don't know of another API in Lucene that has lazy state cached that is global over segments...this would be a fairly new thing here. Maybe we can "hack it" and add something to the KnnFormat reader interface, e.g. "globalPreconditioning" or something that queries can iterate and utilize...ugh, but then queries don't have the very nice API of just "search", and need to do this other step of "precondition"... we don't want to make things super complex for ALL other vector queries that Lucene has or that others wrote :(

this is a tough one.

mccullocht · 2026-05-21T16:59:28Z

Maybe this information should appear as part of FieldInfo? That would solve the "global" aspect of configuration as it would be uniform across all segments, and rotation could be handled in IndexChain and KnnQuery.

shubhamvishu · 2026-05-21T19:53:17Z

@benwtrent @mccullocht Currently we are creating the rotation seed from the field name and caching that for each segment reader so this would be calculated once and reused for a field but I agree with Ben likely there could be benefits of keeping it off heap (or) I'm thinking we could even drop the field from the seed so that its only driven by the dimension (like 1 seed per unique dimensions in the vectors fields indexed?). That way we don't need to have this per segment (just one global object for a specific dimension)? Thoughts?

Maybe this information should appear as part of FieldInfo?

Right, I like the approach to do the rotation upfront into the KnnQuery using the FieldInfos setting. That way it would be global + less intrusive.

shubhamvishu and others added 10 commits May 20, 2026 18:53

Dont bump codec

69ec702

Add flag to enable rotation

e61e258

Set flag to always true

5901d6a

Fix gradle test issues

28a9f68

Enable rotation to luceneutil benchmarks

2106077

Revert "Enable rotation to luceneutil benchmarks"

2fdc47e

This reverts commit 305839b.

github-actions Bot added the module:core/codecs label May 21, 2026

shubhamvishu added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label May 21, 2026

shubhamvishu requested a review from mccullocht May 21, 2026 00:38

mccullocht requested changes May 21, 2026

View reviewed changes

benwtrent reviewed May 21, 2026

View reviewed changes

Conversation

shubhamvishu commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Cohere V3

Baseline (main) :

Candidate (main + rotation i.e. this PR) :

Recall

Uh oh!

shubhamvishu commented May 21, 2026

With internal Amazon 4K vectors embeddings

Uh oh!

mccullocht May 21, 2026

Choose a reason for hiding this comment

Uh oh!

shubhamvishu May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccullocht May 21, 2026

Choose a reason for hiding this comment

Uh oh!

shubhamvishu May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent May 21, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubhamvishu May 21, 2026

Choose a reason for hiding this comment

Baseline :

Candidate:

Uh oh!

shubhamvishu commented May 21, 2026

Luceneutil with Amazon 4K vectors embeddings (forceMerge=False)

Baseline :

Candidate:

Baseline :

Candidate:

Uh oh!

benwtrent May 21, 2026

Choose a reason for hiding this comment

Uh oh!

shubhamvishu May 21, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent commented May 21, 2026

Uh oh!

mccullocht commented May 21, 2026

Uh oh!

shubhamvishu commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shubhamvishu commented May 21, 2026 •

edited

Loading

Baseline (`main`) :

Candidate (`main` + rotation i.e. this PR) :

shubhamvishu May 21, 2026 •

edited

Loading

shubhamvishu May 21, 2026 •

edited

Loading

benwtrent May 21, 2026 •

edited

Loading