Add hadamard rotation to vector fields#16092
Conversation
A lightweight wrapping FlatVectorsFormat that applies a randomized Hadamard rotation to vectors before handing them to a delegate format (e.g. Lucene104ScalarQuantizedVectorsFormat), and rotates query vectors at search time. Because the rotation is orthogonal, dot product, cosine similarity, and Euclidean distance are all preserved, so the delegate's similarity math is unchanged. The rotation redistributes variance across dimensions, which makes OSQ's assumption of Gaussian components hold on datasets whose raw components are skewed or uniform (image pixels, histograms, non-transformer embeddings). Motivation and approach come from the discussion in Apache Lucene PR apache#15903 (TurboQuant) and Elastic's April 2026 blog on BBQ preconditioning, which measured 41-74% recall improvements on GIST / SIFT / Fashion-MNIST at ~2-4% query overhead. Implementation: - HadamardRotation: immutable, thread-safe, O(d log d) via Fast Walsh-Hadamard Transform with random sign flips and a Fisher-Yates permutation. Supports non-power-of-2 dimensions through a block decomposition (e.g. 768 = 512 + 256). Provides both forward and inverse rotations. - RotationPreconditionedVectorsFormat: public FlatVectorsFormat with a no-arg constructor (required for SPI) that defaults to wrapping Lucene104ScalarQuantizedVectorsFormat, plus constructors that take a custom delegate and seed. - RotationPreconditionedVectorsWriter: intercepts addValue to rotate each vector before forwarding to the delegate's field writer. Byte vectors pass through unchanged. - RotationPreconditionedVectorsReader: rotates float query vectors before scoring, inverse-rotates stored vectors in getFloatVectorValues for rescore/CheckIndex callers, and exposes the raw rotated values via getMergeInstance() so that the delegate's merge runs entirely in rotated space (preserving byte-copy merge where the delegate supports it). - Global deterministic rotation per (dim, seed): the same rotation across all segments enables byte-copy merges in the underlying format. SPI wiring: - META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat registers the new format alongside FaissKnnVectorsFormat. - module-info.java exports the package and lists the format under the KnnVectorsFormat 'provides' clause. Tests: - TestHadamardRotation: 11 unit tests covering orthogonality (L2 norm preservation, dot product preservation, Euclidean distance preservation), determinism (same seed -> same rotation), non-identity (different seeds differ), block decomposition correctness, and the spreading property on concentrated inputs. - TestRotationPreconditionedVectorsFormat: extends BaseKnnVectorsFormatTestCase and runs the full suite of KNN-format correctness tests (merge, sort, delete, multi-field, mismatched fields, random exceptions, etc.). Eight tests in the base suite assert bit-exact round-trip equality of indexed vectors; those are overridden with explanatory comments because rotate+inverse_rotate introduces ~1e-7 floating-point drift. Search correctness is unaffected because the rotation is orthogonal. All sandbox tests pass (336 tests, 64 pre-existing skips).
Refactors the rotation preconditioner out of a sandbox FlatVectorsFormat wrapper and into the existing Lucene104ScalarQuantizedVectorsFormat (OSQ) as an opt-in feature controlled by a rotationSeed constructor argument. Rotation is now a first-class capability of OSQ: when a non-zero seed is supplied, every incoming vector is Hadamard-rotated before centroid computation and quantization, and every query is rotated the same way at search time. Because the rotation is orthogonal, all similarity functions (dot product, cosine, Euclidean) are preserved, but per-coordinate distributions become much more Gaussian — which makes OSQ's initialization assumption hold on datasets with skewed or uniform components (image pixels, histograms, non-transformer embeddings). Motivation comes from Apache Lucene PR apache#15903 (TurboQuant) discussion and Elastic's April 2026 blog on BBQ preconditioning, which measured 41-74% recall improvements on GIST / SIFT / Fashion-MNIST at ~2-4% query overhead. Changes: - Move HadamardRotation + its test from lucene/sandbox to lucene/core/src/java/org/apache/lucene/util/quantization/ so it lives next to the existing OptimizedScalarQuantizer. - Lucene104ScalarQuantizedVectorsFormat: add a rotationSeed constructor parameter (default ROTATION_DISABLED = 0 preserves existing behaviour). Bump the on-disk format to VERSION_PRECONDITIONED (1). Old segments (version 0) are still readable; their seed is implicitly 0. - Lucene104HnswScalarQuantizedVectorsFormat: add a matching ctor overload so the HNSW wrapper can enable preconditioning. - Lucene104ScalarQuantizedVectorsWriter: constructor takes the seed; FieldWriter.addValue rotates the incoming vector up front so all downstream OSQ math (centroid accumulation, raw storage, quantization) runs in the rotated basis. writeMeta persists the seed. - Lucene104ScalarQuantizedVectorsReader: FieldEntry now carries rotationSeed; readField reads it when the version supports it. getRandomVectorScorer(String, float[]) rotates the query before scoring. getFloatVectorValues wraps the raw delegate with an InverseRotatedFloatVectorValues so external callers (rerank, CheckIndex, FieldExistsQuery, etc.) see the original vectors they indexed. getMergeInstance() returns a lightweight MergeReader that skips the inverse rotation — the downstream merge then operates entirely in rotated space, preserving consistency across segments. - Remove the sandbox/rotation package and its tests; revert the sandbox module-info and SPI service registration. - Update OSQ and HNSW toString() tests to include rotationSeed. Add TestLucene104ScalarQuantizedVectorsFormatPreconditioning covering end-to-end search with rotation enabled, round-tripping vectors through rotate+inverseRotate via getFloatVectorValues, seed=0 equivalence to the default format, and toString observability. All existing OSQ flat/HNSW/backward-compat tests continue to pass. The 4 new preconditioning tests and the 11 HadamardRotation math tests pass.
Replaces the previous attempt that modified Lucene104 in place. Since
Lucene104 is a shipped codec with a frozen on-disk format, any layout
change belongs in a new codec family.
This commit:
- Restores Lucene104ScalarQuantizedVectorsFormat (and the matching
HNSW wrapper / writer / reader / tests) to their exact pre-patch
state. Anyone with a Lucene104 index can still read it byte-for-byte
the same as before.
- Introduces Lucene105ScalarQuantizedVectorsFormat + the HNSW wrapper
as a new codec family (package
org.apache.lucene.codecs.lucene105). The codec-name headers and
internal NAME strings all use 'Lucene105' so the new layout can be
distinguished at read time. File extensions (.veq, .vemq) are the
same because the codec-name header in each file is what
disambiguates.
- Adds rotation preconditioning natively to Lucene105 as an opt-in
feature controlled by a rotationSeed constructor argument:
* Default / sentinel value ROTATION_DISABLED (0) keeps the format
layout shape matching Lucene104 aside from one extra long per
field in metadata.
* A non-zero seed enables Hadamard rotation at index and query
time. The rotation is orthogonal so dot product / cosine /
Euclidean distances are preserved end to end; what changes is
the per-coordinate distribution of the stored vectors, which
becomes much more Gaussian. This helps OSQ initialization on
datasets with skewed / uniform components (image pixels,
histograms, non-transformer embeddings).
* The seed is persisted in per-field metadata. Reader rotates
queries in getRandomVectorScorer, inverse-rotates stored values
in getFloatVectorValues (so external rerank / CheckIndex /
FieldExistsQuery callers see the original vectors), and exposes
an unrotated view via getMergeInstance so merges stay in the
rotated basis end to end.
- Clones the scorer (Lucene105ScalarQuantizedVectorScorer) and the two
Off-heap value classes (OffHeapScalarQuantizedVectorValues,
OffHeapScalarQuantizedFloatVectorValues) into the new package so the
Lucene104 package-private members don't have to be made public for
the Lucene105 codec to use them. HadamardRotation lives once, in
lucene/core/src/java/org/apache/lucene/util/quantization/, because
it's a utility rather than a codec.
- Registers Lucene105ScalarQuantizedVectorsFormat and
Lucene105HnswScalarQuantizedVectorsFormat via SPI (META-INF
services file and the module-info 'provides' clause), and exports
the new package.
- Adds TestLucene105ScalarQuantizedVectorsFormatPreconditioning with
four targeted tests covering end-to-end preconditioned search,
vector round-trip through rotate+inverseRotate via
getFloatVectorValues, seed=0 equivalence to the default format, and
toString observability.
All existing Lucene104 OSQ tests, Lucene105 preconditioning tests,
HadamardRotation math tests, backward-compat Lucene99 OSQ tests, and
sandbox tests pass.
Usage:
// Pick the old codec for backward compat.
new Lucene104ScalarQuantizedVectorsFormat();
// Pick the new codec with no rotation (default).
new Lucene105ScalarQuantizedVectorsFormat();
// Pick the new codec with rotation preconditioning enabled.
new Lucene105ScalarQuantizedVectorsFormat(
ScalarEncoding.UNSIGNED_BYTE, 0x5eedCafeBabeL);
…ring merge
During force_merge, the HNSW graph builder needs to compare documents
against each other. For 1-bit and 2-bit (asymmetric) encodings, this
requires building a temporary 4-bit "query" representation of each
document by reading back its float vector and re-quantizing it against
the segment centroid.
The bug: getRandomVectorScorerSupplierForMerge() called
getFloatVectorValues(), which inverse-rotates the stored vectors back
to original space (designed for external callers). These un-rotated
vectors were then quantized against the centroid, which lives in
rotated space (computed from rotated vectors during indexing). The
centering step (vector[i] - centroid[i]) mixed original-space vectors
with a rotated-space centroid, producing meaningless 4-bit
representations. The HNSW graph built from these scores was
essentially random, dropping recall from 0.695 to 0.050 (1-bit) and
0.816 to 0.055 (2-bit).
Only 1-bit and 2-bit are affected. 4-bit, 7-bit, and 8-bit use
symmetric scoring which reads already-quantized bytes directly — no
float vectors involved, no rotation mismatch possible.
The fix: use rawVectorsReader.getFloatVectorValues() to read the
stored rotated vectors directly, matching the rotated centroid.
Indices built with the buggy code have corrupted HNSW graphs for
1-bit and 2-bit segments and need reindexing or re-merging.
Benchmark results (Cohere v3 1024d, 500K docs, DOT_PRODUCT):
bits baseline before-fix after-fix
1 0.695 0.050 0.729
2 0.816 0.055 0.841
4 0.918 0.937 0.937
7 0.970 0.982 0.982
8 0.976 0.985 0.985
This reverts commit 305839b.
|
I also ran the same luceneutil with 4K dimensional vectors and I see even higher impact to recall(~6-7% improvement) overall net-net with slight slowness in indexing-rate(~5%) due to rotation overhead. With internal Amazon 4K vectors embeddingsSetup:
|
| FieldInfo info = fieldInfos.fieldInfo(field); | ||
| return info != null | ||
| && "true" | ||
| .equals(info.getAttribute(Lucene104ScalarQuantizedVectorsFormat.ROTATION_ENABLED_KEY)); |
There was a problem hiding this comment.
I haven't ever seen attributes used to enable codec features like this, at least it doesn't appear to be common practice in the core codecs. More typically this would tick VERSION_CURRENT and something (probably just the rotation seed?) would be encoded as part of field metadata. I don't have a good sense as to why maybe @mikemccand or @benwtrent has a firmer and better reasoned opinion.
There was a problem hiding this comment.
Yeah I also reasonate here. The initial version (1st commit) tried to retain it in the sandox module -> then moved it into the main core Codec but it was writing a random seed to the metadata of the field and I had to bump the codec to 105. I honestly didn't wanted to bump the Codec for this given usecase so I switched to use a constant seed and assuming rotation is enabled for all vector fields by default(which obviously simplified all of this but takes away the capability from user to configure it on per field basis). So eventually I biased towards sharing the rotationEnabled flag (configured per vector field) to query time via FieldInfos and avoiding Codec bump since this way we were not breaking the backward compatibility and also how simple it was in nature. I'm open to ideas whichever we would want to choose or if there is a better way to share this info(rotation enabled/disabled) hopefully avoiding the codec bump.
| if (isRotationEnabled(field) && target != null) { | ||
| HadamardRotation rotation = rotationFor(field, fi.dimension); | ||
| float[] rotated = new float[target.length]; | ||
| rotation.rotate(target, rotated); |
There was a problem hiding this comment.
IIUC this rotation operation is probably fairly expensive -- at least as expensive as quantization but possibly even more expensive. In Lucene's segment structure this operation will be repeated for every segment searched. In your tests you probably ran with everything merged down to a single segment but I'm interested in what this costs in a more typical multi-segment setup. A microbenchmark for the rotation or a full luceneutil run would be helpful here.
Depending on the cost we might want to figure out a way to reuse this computation across segments somehow which probably requires upstream integration with the knn query classes.
There was a problem hiding this comment.
rotate is a O(d log d) operation here. I ran the luceneutil without forceMerge too but I don't have the Cohere results handy anymore(I can pull those up again). Though I have the results with 4K dim embeddings in multi segment index. Sharing those below for you reference(will share the Cohere one also soon). I didn't seem to regress the latency as such. We can do more JMH benchmarking or whatever could give us more confidence but so far it appears to be a cheap cost(Idk if the quantization overhead was significant in past).
There was a problem hiding this comment.
We cannot do a rotation per segment, that is a non starter.
This would significantly impact latency in the most common scenario, which is many segments.
Instead I suggest queries need to have an "additional phase" that looks to see if any of the KnnFormats can apply a "globalPrecondition" step or something and then apply it once for the query.
There was a problem hiding this comment.
To see the significance of the performance impact, you will need many vectors spread over many segments (which is very common, I mean 10s of millions spread over 30-50+ segments).
There was a problem hiding this comment.
@benwtrent I see, yeah I think moving in upstream and doing it once makes sense to me. Maybe just moving it to KnnQuery as Trevor mentioned below?
To see the significance of the performance impact, you will need many vectors spread over many segments
That could be the case yes. Sharing below the cohere v3 run I put with multi segment and it seems to have not much impact but larger segments of 30M might move the needle.
Result : Cohere V3 without forceMerge
Baseline :
Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB)
0.695 1.916 5.285 2.758 1 bits 54581 94.09 5314.12 5 2092.52 2020.836 67.711
0.817 1.887 6.052 3.208 2 bits 50314 95.73 5223.08 5 2147.02 2081.871 128.746
0.921 2.107 7.275 3.453 4 bits 50877 97.97 5103.60 6 2266.48 2204.895 251.770
0.976 3.300 12.843 3.892 7 bits 61241 101.92 4906.00 9 2506.59 2449.036 495.911
0.983 3.280 12.810 3.905 8 bits 61080 99.65 5017.51 9 2506.62 2449.036 495.911
Candidate:
Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall latency(ms) netCPU avgCpuCount quantized visited index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB)
0.729 1.845 5.056 2.740 1 bits 51755 96.87 5161.40 5 2090.45 2020.836 67.711
0.843 1.870 5.904 3.157 2 bits 49123 98.38 5082.18 5 2146.54 2081.871 128.746
0.940 2.089 7.143 3.419 4 bits 50584 95.99 5209.04 6 2266.39 2204.895 251.770
0.988 3.255 12.872 3.955 7 bits 60998 100.92 4954.52 9 2506.59 2449.036 495.911
0.993 3.244 12.887 3.973 8 bits 60955 100.85 4958.01 9 2506.60 2449.036 495.911
Luceneutil with Amazon 4K vectors embeddings (forceMerge=False)NOTE : Run 1 and 2 are on separate 4K embedding dataset(500K) so sharing both Run 1Baseline :Candidate:Run 2Baseline :Candidate:cc - @mccullocht |
| private final FieldInfos fieldInfos; | ||
|
|
||
| /** Lazily built Hadamard rotations, keyed by field name. */ | ||
| private final Map<String, HadamardRotation> rotations = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
I don't know about this. This isn't cheap. We have "fieldnum_segmentssize(HadamardRotationObject)",
Seems to me the rotation matrix should just be stored off heap :/ Or the rotation is by dimension, not by field name (idk why we need a new random matrix for this for two fields that have the same dimension).
There was a problem hiding this comment.
idk why we need a new random matrix for this for two fields that have the same dimension
It was because the random seed is taking the field name into picture but I agree to your point we could reuse the same matrix across vectors of same dimension actually (likely there is no/much benefit of having random seeds other than avoiding the possibility of choosing a bad seed for all vector fields but this simplification overshadows that possibility without completely discarding). I'll try to stick to a single rotation matrix for a dimension only. Do you think there is enough value of moving it off heap after having 1 rotation or it'll be an overkill?
|
Its a nice idea. I think we should strive to have a general "precondition vectors" interface. I am sure on the idea of having it integrated via field infos...I need to do some thinking here. Two big issues besides the API that are bothering me:
I don't know of another API in Lucene that has lazy state cached that is global over segments...this would be a fairly new thing here. Maybe we can "hack it" and add something to the KnnFormat reader interface, e.g. "globalPreconditioning" or something that queries can iterate and utilize...ugh, but then queries don't have the very nice API of just "search", and need to do this other step of "precondition"... we don't want to make things super complex for ALL other vector queries that Lucene has or that others wrote :( this is a tough one. |
|
Maybe this information should appear as part of FieldInfo? That would solve the "global" aspect of configuration as it would be uniform across all segments, and rotation could be handled in IndexChain and KnnQuery. |
|
@benwtrent @mccullocht Currently we are creating the rotation seed from the field name and caching that for each segment reader so this would be calculated once and reused for a field but I agree with Ben likely there could be benefits of keeping it off heap (or) I'm thinking we could even drop the field from the seed so that its only driven by the dimension (like 1 seed per unique dimensions in the vectors fields indexed?). That way we don't need to have this per segment (just one global object for a specific dimension)? Thoughts?
Right, I like the approach to do the rotation upfront into the KnnQuery using the FieldInfos setting. That way it would be global + less intrusive. |
Description
I worked with CC(Claude Code; did a great job in all phases from initial impl to testing) to have this PR which adds the Hadamard rotation(Fast Walsh Hadamard Transform) to vector fields(default false; configurable codec param; no codec bump required) inspired from the @xande 's TurboQuant PR (who works with me on Amazon Product Search) but a stripped down version just adding rotation to vectors in isolation. This address the 2nd item
Implement random rotation of vectors and queries.from Data-blind scalar quantization issue @mccullocht is working on.I'm opening this to gather community feedback, as it shows promising recall improvements. I'd like to see whether we want to incorporate this into Lucene, reuse some of these ideas, or discard the approach if there are concerns.
The shows upto ~5-7% recall improvement in luceneutil benchmarks with Cohere V3 and Amazon's internal 4K dim vector embeddings. Current approach rotates the incoming float vectors at insertion (so we index the vectors in rotated space in .vec file) and rest of the flow continues as is. It stores whether to do rotation for a vector field or not info in the
FieldInfos. At query time, it checks if the field has rotation enabled and rotates the query it true.TL;DR : Randomized orthogonal rotation (sign flips + permutation + FWHT) that Gaussianizes vector dimensions distributions to favor the scalar quantization(OSQ) accuracy while preserving distances.
Here's an ASCII diagram Claude generated explaining the Hadamard rotation steps :
Cohere V3
Baseline (
main) :Candidate (
main+ rotation i.e. this PR) :Recall