Skip to content

Add hadamard rotation to vector fields#16092

Open
shubhamvishu wants to merge 10 commits into
apache:mainfrom
shubhamvishu:rotation
Open

Add hadamard rotation to vector fields#16092
shubhamvishu wants to merge 10 commits into
apache:mainfrom
shubhamvishu:rotation

Conversation

@shubhamvishu
Copy link
Copy Markdown
Contributor

@shubhamvishu shubhamvishu commented May 21, 2026

Description

I worked with CC(Claude Code; did a great job in all phases from initial impl to testing) to have this PR which adds the Hadamard rotation(Fast Walsh Hadamard Transform) to vector fields(default false; configurable codec param; no codec bump required) inspired from the @xande 's TurboQuant PR (who works with me on Amazon Product Search) but a stripped down version just adding rotation to vectors in isolation. This address the 2nd item Implement random rotation of vectors and queries. from Data-blind scalar quantization issue @mccullocht is working on.

I'm opening this to gather community feedback, as it shows promising recall improvements. I'd like to see whether we want to incorporate this into Lucene, reuse some of these ideas, or discard the approach if there are concerns.

The shows upto ~5-7% recall improvement in luceneutil benchmarks with Cohere V3 and Amazon's internal 4K dim vector embeddings. Current approach rotates the incoming float vectors at insertion (so we index the vectors in rotated space in .vec file) and rest of the flow continues as is. It stores whether to do rotation for a vector field or not info in the FieldInfos. At query time, it checks if the field has rotation enabled and rotates the query it true.

TL;DR : Randomized orthogonal rotation (sign flips + permutation + FWHT) that Gaussianizes vector dimensions distributions to favor the scalar quantization(OSQ) accuracy while preserving distances.

Here's an ASCII diagram Claude generated explaining the Hadamard rotation steps :
 HadamardRotation: out = R · in
 ════════════════════════════════

Input vector (arbitrary distribution)
┌─────────────────────────────────────────────────────────┐
│  in[0]   in[1]   in[2]   in[3]   in[4]  ...  in[d-1]    │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 1: Random Sign Flips                              │
│                                                         │
│  For each component, randomly negate (±1):              │
│    scratch[i] = signFlips[i] ? -in[i] : in[i]           │
│                                                         │
│  Purpose: breaks correlations between components        │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 2: Random Permutation (Fisher-Yates shuffle)      │
│                                                         │
│  Rearrange components randomly:                         │
│    out[perm[i]] = scratch[i]                            │
│                                                         │
│  Purpose: distributes information across all positions  │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  STEP 3: Block-Diagonal FWHT (in-place on out[])        │
│                                                         │
│  For non-power-of-2 dims, decompose into blocks:        │
│    e.g. dim=768 → [512] [256]                           │
│                                                         │
│  ┌──────────────────────┐ ┌────────────────┐            │
│  │  FWHT on block (512) │ │ FWHT on (256)  │            │
│  │                      │ │                │            │
│  │   ╱╲    butterfly    │ │   ╱╲           │            │
│  │  ╱  ╲   operations   │ │  ╱  ╲          │            │
│  │ x+y  x-y            │ │ x+y  x-y      │              │
│  │                      │ │                │            │
│  │  × 1/√512 normalize  │ │ × 1/√256       │            │
│  └──────────────────────┘ └────────────────┘            │
│                                                         │
│  Purpose: mixes components → Central Limit Theorem      │
│           drives distribution toward Gaussian           │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
Output vector (≈ Gaussian distribution per component)
┌─────────────────────────────────────────────────────────┐
│ out[0]  out[1]  out[2]  out[3]  out[4]  ...  out[d-1]   │
└─────────────────────────────────────────────────────────┘


KEY PROPERTIES:
═══════════════
• Orthogonal:  ‖Rx‖ = ‖x‖         (preserves L2 norm)
               (Rx)·(Ry) = x·y     (preserves dot products)

• Deterministic: same (dim, seed) → same rotation always

• Cost: O(d log d) for FWHT + O(d) for signs & permutation

• Invertible: R⁻¹ = Rᵀ  (just reverse the 3 steps)


WHY IT HELPS QUANTIZATION:
══════════════════════════

Before rotation:              After rotation:
(skewed/sparse/uniform)       (≈ Gaussian)

 ▌                               ▄
 ▌                              ▄█▄
 ▌▌                           ▄▄███▄▄
 ▌▌▌                        ▄▄██████▄▄▄
▄▌▌▌▌▄▄▄▄▄▄▄▄▄▄▄▄▄▄     ▄▄▄████████████▄▄▄
─────────────────────     ─────────────────────
→ Quantization bins       → Quantization bins
  poorly utilized            evenly utilized

Cohere V3

Baseline (main) :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.695        2.413   2.409        0.998     1 bits     9589     96.07       5204.65          110.07         2102.03      2020.836       67.711
 0.816        2.512   2.510        0.999     2 bits     8614     98.17       5093.00          137.36         2156.30      2081.871      128.746
 0.918        2.711   2.709        0.999     4 bits     8221     96.06       5205.08          135.73         2276.80      2204.895      251.770
 0.970        3.789   3.788        1.000     7 bits     8129    101.13       4944.28          228.06         2520.64      2449.036      495.911
 0.976        3.804   3.803        1.000     8 bits     8129    101.92       4905.71          230.64         2520.66      2449.036      495.911

Candidate (main + rotation i.e. this PR) :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: num_segments = 1 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  force_merge(s)  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.729        2.243   2.241        0.999     1 bits     9174     96.59       5176.47          100.21         2099.73      2020.836       67.711
 0.841        2.478   2.474        0.998     2 bits     8468     97.67       5119.12          128.30         2155.75      2081.871      128.746
 0.937        2.613   2.612        1.000     4 bits     8162    100.38       4980.97          124.06         2276.62      2204.895      251.770
 0.982        3.704   3.702        0.999     7 bits     8124    102.14       4895.48          227.00         2520.65      2449.036      495.911
 0.985        3.685   3.684        1.000     8 bits     8120     99.64       5017.91          227.74         2520.64      2449.036      495.911

Recall

Bits Baseline Rotation Delta % Gain
1 0.695 0.729 +0.034 +4.9%
2 0.816 0.841 +0.025 +3.1%
4 0.918 0.937 +0.019 +2.1%
7 0.970 0.982 +0.012 +1.2%
8 0.976 0.985 +0.009 +0.9%

shubhamvishu and others added 10 commits May 20, 2026 18:53
A lightweight wrapping FlatVectorsFormat that applies a randomized
Hadamard rotation to vectors before handing them to a delegate format
(e.g. Lucene104ScalarQuantizedVectorsFormat), and rotates query vectors
at search time. Because the rotation is orthogonal, dot product, cosine
similarity, and Euclidean distance are all preserved, so the delegate's
similarity math is unchanged. The rotation redistributes variance across
dimensions, which makes OSQ's assumption of Gaussian components hold on
datasets whose raw components are skewed or uniform (image pixels,
histograms, non-transformer embeddings).

Motivation and approach come from the discussion in Apache Lucene PR
apache#15903 (TurboQuant) and Elastic's April 2026 blog on BBQ
preconditioning, which measured 41-74% recall improvements on GIST /
SIFT / Fashion-MNIST at ~2-4% query overhead.

Implementation:
- HadamardRotation: immutable, thread-safe, O(d log d) via Fast
  Walsh-Hadamard Transform with random sign flips and a Fisher-Yates
  permutation. Supports non-power-of-2 dimensions through a block
  decomposition (e.g. 768 = 512 + 256). Provides both forward and
  inverse rotations.
- RotationPreconditionedVectorsFormat: public FlatVectorsFormat with a
  no-arg constructor (required for SPI) that defaults to wrapping
  Lucene104ScalarQuantizedVectorsFormat, plus constructors that take a
  custom delegate and seed.
- RotationPreconditionedVectorsWriter: intercepts addValue to rotate
  each vector before forwarding to the delegate's field writer.
  Byte vectors pass through unchanged.
- RotationPreconditionedVectorsReader: rotates float query vectors
  before scoring, inverse-rotates stored vectors in
  getFloatVectorValues for rescore/CheckIndex callers, and exposes the
  raw rotated values via getMergeInstance() so that the delegate's
  merge runs entirely in rotated space (preserving byte-copy merge
  where the delegate supports it).
- Global deterministic rotation per (dim, seed): the same rotation
  across all segments enables byte-copy merges in the underlying
  format.

SPI wiring:
- META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat registers
  the new format alongside FaissKnnVectorsFormat.
- module-info.java exports the package and lists the format under the
  KnnVectorsFormat 'provides' clause.

Tests:
- TestHadamardRotation: 11 unit tests covering orthogonality (L2 norm
  preservation, dot product preservation, Euclidean distance
  preservation), determinism (same seed -> same rotation), non-identity
  (different seeds differ), block decomposition correctness, and the
  spreading property on concentrated inputs.
- TestRotationPreconditionedVectorsFormat: extends
  BaseKnnVectorsFormatTestCase and runs the full suite of KNN-format
  correctness tests (merge, sort, delete, multi-field, mismatched
  fields, random exceptions, etc.). Eight tests in the base suite
  assert bit-exact round-trip equality of indexed vectors; those are
  overridden with explanatory comments because rotate+inverse_rotate
  introduces ~1e-7 floating-point drift. Search correctness is
  unaffected because the rotation is orthogonal.

All sandbox tests pass (336 tests, 64 pre-existing skips).
Refactors the rotation preconditioner out of a sandbox FlatVectorsFormat
wrapper and into the existing Lucene104ScalarQuantizedVectorsFormat
(OSQ) as an opt-in feature controlled by a rotationSeed constructor
argument. Rotation is now a first-class capability of OSQ: when a
non-zero seed is supplied, every incoming vector is Hadamard-rotated
before centroid computation and quantization, and every query is
rotated the same way at search time. Because the rotation is
orthogonal, all similarity functions (dot product, cosine, Euclidean)
are preserved, but per-coordinate distributions become much more
Gaussian — which makes OSQ's initialization assumption hold on
datasets with skewed or uniform components (image pixels, histograms,
non-transformer embeddings).

Motivation comes from Apache Lucene PR apache#15903 (TurboQuant) discussion
and Elastic's April 2026 blog on BBQ preconditioning, which measured
41-74% recall improvements on GIST / SIFT / Fashion-MNIST at ~2-4%
query overhead.

Changes:
- Move HadamardRotation + its test from lucene/sandbox to
  lucene/core/src/java/org/apache/lucene/util/quantization/ so it
  lives next to the existing OptimizedScalarQuantizer.
- Lucene104ScalarQuantizedVectorsFormat: add a rotationSeed
  constructor parameter (default ROTATION_DISABLED = 0 preserves
  existing behaviour). Bump the on-disk format to
  VERSION_PRECONDITIONED (1). Old segments (version 0) are still
  readable; their seed is implicitly 0.
- Lucene104HnswScalarQuantizedVectorsFormat: add a matching ctor
  overload so the HNSW wrapper can enable preconditioning.
- Lucene104ScalarQuantizedVectorsWriter: constructor takes the seed;
  FieldWriter.addValue rotates the incoming vector up front so all
  downstream OSQ math (centroid accumulation, raw storage, quantization)
  runs in the rotated basis. writeMeta persists the seed.
- Lucene104ScalarQuantizedVectorsReader: FieldEntry now carries
  rotationSeed; readField reads it when the version supports it.
  getRandomVectorScorer(String, float[]) rotates the query before
  scoring. getFloatVectorValues wraps the raw delegate with an
  InverseRotatedFloatVectorValues so external callers (rerank,
  CheckIndex, FieldExistsQuery, etc.) see the original vectors they
  indexed. getMergeInstance() returns a lightweight MergeReader that
  skips the inverse rotation — the downstream merge then operates
  entirely in rotated space, preserving consistency across segments.
- Remove the sandbox/rotation package and its tests; revert the
  sandbox module-info and SPI service registration.
- Update OSQ and HNSW toString() tests to include rotationSeed.

Add TestLucene104ScalarQuantizedVectorsFormatPreconditioning covering
end-to-end search with rotation enabled, round-tripping vectors
through rotate+inverseRotate via getFloatVectorValues, seed=0
equivalence to the default format, and toString observability.

All existing OSQ flat/HNSW/backward-compat tests continue to pass.
The 4 new preconditioning tests and the 11 HadamardRotation math tests
pass.
Replaces the previous attempt that modified Lucene104 in place. Since
Lucene104 is a shipped codec with a frozen on-disk format, any layout
change belongs in a new codec family.

This commit:

- Restores Lucene104ScalarQuantizedVectorsFormat (and the matching
  HNSW wrapper / writer / reader / tests) to their exact pre-patch
  state. Anyone with a Lucene104 index can still read it byte-for-byte
  the same as before.

- Introduces Lucene105ScalarQuantizedVectorsFormat + the HNSW wrapper
  as a new codec family (package
  org.apache.lucene.codecs.lucene105). The codec-name headers and
  internal NAME strings all use 'Lucene105' so the new layout can be
  distinguished at read time. File extensions (.veq, .vemq) are the
  same because the codec-name header in each file is what
  disambiguates.

- Adds rotation preconditioning natively to Lucene105 as an opt-in
  feature controlled by a rotationSeed constructor argument:
    * Default / sentinel value ROTATION_DISABLED (0) keeps the format
      layout shape matching Lucene104 aside from one extra long per
      field in metadata.
    * A non-zero seed enables Hadamard rotation at index and query
      time. The rotation is orthogonal so dot product / cosine /
      Euclidean distances are preserved end to end; what changes is
      the per-coordinate distribution of the stored vectors, which
      becomes much more Gaussian. This helps OSQ initialization on
      datasets with skewed / uniform components (image pixels,
      histograms, non-transformer embeddings).
    * The seed is persisted in per-field metadata. Reader rotates
      queries in getRandomVectorScorer, inverse-rotates stored values
      in getFloatVectorValues (so external rerank / CheckIndex /
      FieldExistsQuery callers see the original vectors), and exposes
      an unrotated view via getMergeInstance so merges stay in the
      rotated basis end to end.

- Clones the scorer (Lucene105ScalarQuantizedVectorScorer) and the two
  Off-heap value classes (OffHeapScalarQuantizedVectorValues,
  OffHeapScalarQuantizedFloatVectorValues) into the new package so the
  Lucene104 package-private members don't have to be made public for
  the Lucene105 codec to use them. HadamardRotation lives once, in
  lucene/core/src/java/org/apache/lucene/util/quantization/, because
  it's a utility rather than a codec.

- Registers Lucene105ScalarQuantizedVectorsFormat and
  Lucene105HnswScalarQuantizedVectorsFormat via SPI (META-INF
  services file and the module-info 'provides' clause), and exports
  the new package.

- Adds TestLucene105ScalarQuantizedVectorsFormatPreconditioning with
  four targeted tests covering end-to-end preconditioned search,
  vector round-trip through rotate+inverseRotate via
  getFloatVectorValues, seed=0 equivalence to the default format, and
  toString observability.

All existing Lucene104 OSQ tests, Lucene105 preconditioning tests,
HadamardRotation math tests, backward-compat Lucene99 OSQ tests, and
sandbox tests pass.

Usage:
    // Pick the old codec for backward compat.
    new Lucene104ScalarQuantizedVectorsFormat();

    // Pick the new codec with no rotation (default).
    new Lucene105ScalarQuantizedVectorsFormat();

    // Pick the new codec with rotation preconditioning enabled.
    new Lucene105ScalarQuantizedVectorsFormat(
        ScalarEncoding.UNSIGNED_BYTE, 0x5eedCafeBabeL);
…ring merge

During force_merge, the HNSW graph builder needs to compare documents
against each other. For 1-bit and 2-bit (asymmetric) encodings, this
requires building a temporary 4-bit "query" representation of each
document by reading back its float vector and re-quantizing it against
the segment centroid.

The bug: getRandomVectorScorerSupplierForMerge() called
getFloatVectorValues(), which inverse-rotates the stored vectors back
to original space (designed for external callers). These un-rotated
vectors were then quantized against the centroid, which lives in
rotated space (computed from rotated vectors during indexing). The
centering step (vector[i] - centroid[i]) mixed original-space vectors
with a rotated-space centroid, producing meaningless 4-bit
representations. The HNSW graph built from these scores was
essentially random, dropping recall from 0.695 to 0.050 (1-bit) and
0.816 to 0.055 (2-bit).

Only 1-bit and 2-bit are affected. 4-bit, 7-bit, and 8-bit use
symmetric scoring which reads already-quantized bytes directly — no
float vectors involved, no rotation mismatch possible.

The fix: use rawVectorsReader.getFloatVectorValues() to read the
stored rotated vectors directly, matching the rotated centroid.

Indices built with the buggy code have corrupted HNSW graphs for
1-bit and 2-bit segments and need reindexing or re-merging.

Benchmark results (Cohere v3 1024d, 500K docs, DOT_PRODUCT):

  bits  baseline  before-fix  after-fix
     1     0.695       0.050      0.729
     2     0.816       0.055      0.841
     4     0.918       0.937      0.937
     7     0.970       0.982      0.982
     8     0.976       0.985      0.985
@shubhamvishu shubhamvishu added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label May 21, 2026
@shubhamvishu
Copy link
Copy Markdown
Contributor Author

I also ran the same luceneutil with 4K dimensional vectors and I see even higher impact to recall(~6-7% improvement) overall net-net with slight slowness in indexing-rate(~5%) due to rotation overhead.
 

With internal Amazon 4K vectors embeddings

Setup:

  • 500K docs, 4096 dimensions, DOT_PRODUCT similarity, HNSW (maxConn=64, beamWidth=250), 10K queries.
Bits Avg Baseline Recall Avg Candidate Recall(Rotation) Avg Delta % Diff
1 0.828 0.889 +0.061 +7.4%
2 0.858 0.916 +0.058 +6.7%
4 0.893 0.958 +0.066 +7.3%
7 0.920 0.972 +0.052 +5.6%
8 0.927 0.974 +0.047 +5.1%
All bits 0.885 0.942 +0.057 +6.4%

Metric Baseline Rotation % Diff Impact
Recall (avg all) 0.885 0.942 +6.4% Improvement
Search latency ~2.0 ms ~2.0 ms ~0% No change
Index rate ~3830 docs/s ~3620 docs/s -5.5% Slightly slower
Index size 8091-9799 MB 8091-9799 MB 0% Identical
Force merge time ~213 s ~194 s -8.9% No regression

@shubhamvishu shubhamvishu requested a review from mccullocht May 21, 2026 00:38
FieldInfo info = fieldInfos.fieldInfo(field);
return info != null
&& "true"
.equals(info.getAttribute(Lucene104ScalarQuantizedVectorsFormat.ROTATION_ENABLED_KEY));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't ever seen attributes used to enable codec features like this, at least it doesn't appear to be common practice in the core codecs. More typically this would tick VERSION_CURRENT and something (probably just the rotation seed?) would be encoded as part of field metadata. I don't have a good sense as to why maybe @mikemccand or @benwtrent has a firmer and better reasoned opinion.

Copy link
Copy Markdown
Contributor Author

@shubhamvishu shubhamvishu May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I also reasonate here. The initial version (1st commit) tried to retain it in the sandox module -> then moved it into the main core Codec but it was writing a random seed to the metadata of the field and I had to bump the codec to 105. I honestly didn't wanted to bump the Codec for this given usecase so I switched to use a constant seed and assuming rotation is enabled for all vector fields by default(which obviously simplified all of this but takes away the capability from user to configure it on per field basis). So eventually I biased towards sharing the rotationEnabled flag (configured per vector field) to query time via FieldInfos and avoiding Codec bump since this way we were not breaking the backward compatibility and also how simple it was in nature. I'm open to ideas whichever we would want to choose or if there is a better way to share this info(rotation enabled/disabled) hopefully avoiding the codec bump.

if (isRotationEnabled(field) && target != null) {
HadamardRotation rotation = rotationFor(field, fi.dimension);
float[] rotated = new float[target.length];
rotation.rotate(target, rotated);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this rotation operation is probably fairly expensive -- at least as expensive as quantization but possibly even more expensive. In Lucene's segment structure this operation will be repeated for every segment searched. In your tests you probably ran with everything merged down to a single segment but I'm interested in what this costs in a more typical multi-segment setup. A microbenchmark for the rotation or a full luceneutil run would be helpful here.

Depending on the cost we might want to figure out a way to reuse this computation across segments somehow which probably requires upstream integration with the knn query classes.

Copy link
Copy Markdown
Contributor Author

@shubhamvishu shubhamvishu May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rotate is a O(d log d) operation here. I ran the luceneutil without forceMerge too but I don't have the Cohere results handy anymore(I can pull those up again). Though I have the results with 4K dim embeddings in multi segment index. Sharing those below for you reference(will share the Cohere one also soon). I didn't seem to regress the latency as such. We can do more JMH benchmarking or whatever could give us more confidence but so far it appears to be a cheap cost(Idk if the quantization overhead was significant in past).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do a rotation per segment, that is a non starter.

This would significantly impact latency in the most common scenario, which is many segments.

Instead I suggest queries need to have an "additional phase" that looks to see if any of the KnnFormats can apply a "globalPrecondition" step or something and then apply it once for the query.

Copy link
Copy Markdown
Member

@benwtrent benwtrent May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see the significance of the performance impact, you will need many vectors spread over many segments (which is very common, I mean 10s of millions spread over 30-50+ segments).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent I see, yeah I think moving in upstream and doing it once makes sense to me. Maybe just moving it to KnnQuery as Trevor mentioned below?

To see the significance of the performance impact, you will need many vectors spread over many segments

That could be the case yes. Sharing below the cohere v3 run I put with multi segment and it seems to have not much impact but larger segments of 30M might move the needle.

Result : Cohere V3 without forceMerge

Baseline :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.695        1.916   5.285        2.758     1 bits    54581     94.09       5314.12             5         2092.52      2020.836       67.711
 0.817        1.887   6.052        3.208     2 bits    50314     95.73       5223.08             5         2147.02      2081.871      128.746
 0.921        2.107   7.275        3.453     4 bits    50877     97.97       5103.60             6         2266.48      2204.895      251.770
 0.976        3.300  12.843        3.892     7 bits    61241    101.92       4906.00             9         2506.59      2449.036      495.911
 0.983        3.280  12.810        3.905     8 bits    61080     99.65       5017.51             9         2506.62      2449.036      495.911

Candidate:

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.729        1.845   5.056        2.740     1 bits    51755     96.87       5161.40             5         2090.45      2020.836       67.711
 0.843        1.870   5.904        3.157     2 bits    49123     98.38       5082.18             5         2146.54      2081.871      128.746
 0.940        2.089   7.143        3.419     4 bits    50584     95.99       5209.04             6         2266.39      2204.895      251.770
 0.988        3.255  12.872        3.955     7 bits    60998    100.92       4954.52             9         2506.59      2449.036      495.911
 0.993        3.244  12.887        3.973     8 bits    60955    100.85       4958.01             9         2506.60      2449.036      495.911

@shubhamvishu
Copy link
Copy Markdown
Contributor Author

Luceneutil with Amazon 4K vectors embeddings (forceMerge=False)

NOTE : Run 1 and 2 are on separate 4K embedding dataset(500K) so sharing both

Run 1

Baseline :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.847        1.502   6.435        4.284     1 bits    30107    122.60       4078.17            11         8090.94      8063.316      250.816
 0.873        1.753   7.827        4.464     2 bits    28103    124.40       4019.26            12         8333.67      8307.457      494.957
 0.906        2.459  10.420        4.238     4 bits    28075    123.27       4056.01            12         8821.31      8796.692      984.192
 0.931        3.068  11.445        3.730     7 bits    19281    145.29       3441.42             7         9798.58      9773.254     1960.754
 0.936        3.467  10.623        3.064     8 bits    18113    144.72       3454.88             6         9798.82      9773.254     1960.754

Candidate:

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.906        1.176   4.908        4.174     1 bits    22856    134.87       3707.16             9         8090.28      8063.316      250.816
 0.936        1.459   5.608        3.845     2 bits    20761    126.32       3958.30             9         8333.64      8307.457      494.957
 0.971        2.228   9.872        4.430     4 bits    27231    140.24       3565.42            12         8821.80      8796.692      984.192
 0.988        3.446  15.027        4.361     7 bits    25644    148.67       3363.27            10         9798.57      9773.254     1960.754
 0.989        3.164  11.303        3.573     8 bits    19280    145.93       3426.35             7         9799.14      9773.254     1960.754
Run 2

Baseline :

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.813        1.145   4.208        3.676     1 bits    17977    121.83       4104.15             8         8089.82      8063.316      250.816
 0.849        1.423   6.169        4.336     2 bits    21256    115.82       4317.01            10         8332.16      8307.457      494.957
 0.885        1.805   6.790        3.761     4 bits    17201    130.58       3829.22             9         8820.67      8796.692      984.192
 0.921        2.918  12.660        4.339     7 bits    20622    136.97       3650.57            10         9796.88      9773.254     1960.754
 0.926        2.678   8.020        2.995     8 bits    13287    141.24       3540.07             5         9797.46      9773.254     1960.754

Candidate:

Results:
NOTE: nDoc = 500000 for all runs; skipping column
NOTE: searchType = KNN for all runs; skipping column
NOTE: topK = 100 for all runs; skipping column
NOTE: fanout = 100 for all runs; skipping column
NOTE: resultSimilarity = N/A for all runs; skipping column
NOTE: decay = N/A for all runs; skipping column
NOTE: resultCount = 100.000 for all runs; skipping column
NOTE: maxConn = 64 for all runs; skipping column
NOTE: beamWidth = 250 for all runs; skipping column
NOTE: force_merge(s) = 0.00 for all runs; skipping column
NOTE: filterStrategy = null for all runs; skipping column
NOTE: filterSelectivity = N/A for all runs; skipping column
NOTE: overSample = 1.000 for all runs; skipping column
NOTE: bp-reorder = false for all runs; skipping column
NOTE: indexType = HNSW for all runs; skipping column
NOTE: rerank = no for all runs; skipping column
recall  latency(ms)  netCPU  avgCpuCount  quantized  visited  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
 0.878        1.104   4.588        4.154     1 bits    20309    133.12       3756.15             9         8089.49      8063.316      250.816
 0.906        1.333   5.127        3.847     2 bits    17963    136.62       3659.87             9         8333.10      8307.457      494.957
 0.957        1.797   5.546        3.085     4 bits    15179    140.91       3548.41             6         8821.53      8796.692      984.192
 0.969        2.928  10.953        3.741     7 bits    17888    143.28       3489.79             9         9797.76      9773.254     1960.754
 0.969        3.004  11.197        3.727     8 bits    18177    136.01       3676.17             9         9797.79      9773.254     1960.754

cc - @mccullocht

private final FieldInfos fieldInfos;

/** Lazily built Hadamard rotations, keyed by field name. */
private final Map<String, HadamardRotation> rotations = new ConcurrentHashMap<>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about this. This isn't cheap. We have "fieldnum_segmentssize(HadamardRotationObject)",

Seems to me the rotation matrix should just be stored off heap :/ Or the rotation is by dimension, not by field name (idk why we need a new random matrix for this for two fields that have the same dimension).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk why we need a new random matrix for this for two fields that have the same dimension

It was because the random seed is taking the field name into picture but I agree to your point we could reuse the same matrix across vectors of same dimension actually (likely there is no/much benefit of having random seeds other than avoiding the possibility of choosing a bad seed for all vector fields but this simplification overshadows that possibility without completely discarding). I'll try to stick to a single rotation matrix for a dimension only. Do you think there is enough value of moving it off heap after having 1 rotation or it'll be an overkill?

@benwtrent
Copy link
Copy Markdown
Member

Its a nice idea. I think we should strive to have a general "precondition vectors" interface. I am sure on the idea of having it integrated via field infos...I need to do some thinking here.

Two big issues besides the API that are bothering me:

  • Keeping a bunch of rotation matrices on heap for every segment is unnecessarily expensive
  • One precondition for all segments is critical.

I don't know of another API in Lucene that has lazy state cached that is global over segments...this would be a fairly new thing here. Maybe we can "hack it" and add something to the KnnFormat reader interface, e.g. "globalPreconditioning" or something that queries can iterate and utilize...ugh, but then queries don't have the very nice API of just "search", and need to do this other step of "precondition"... we don't want to make things super complex for ALL other vector queries that Lucene has or that others wrote :(

this is a tough one.

@mccullocht
Copy link
Copy Markdown
Contributor

Maybe this information should appear as part of FieldInfo? That would solve the "global" aspect of configuration as it would be uniform across all segments, and rotation could be handled in IndexChain and KnnQuery.

@shubhamvishu
Copy link
Copy Markdown
Contributor Author

@benwtrent @mccullocht Currently we are creating the rotation seed from the field name and caching that for each segment reader so this would be calculated once and reused for a field but I agree with Ben likely there could be benefits of keeping it off heap (or) I'm thinking we could even drop the field from the seed so that its only driven by the dimension (like 1 seed per unique dimensions in the vectors fields indexed?). That way we don't need to have this per segment (just one global object for a specific dimension)? Thoughts?

Maybe this information should appear as part of FieldInfo?

Right, I like the approach to do the rotation upfront into the KnnQuery using the FieldInfos setting. That way it would be global + less intrusive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core/codecs skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants