Skip to content

STAR-1693: Changes from OSS - DO NOT MERGE#552

Closed
jacek-lewandowski wants to merge 372 commits intods-trunkfrom
STAR-1693-merge-ds-trunk
Closed

STAR-1693: Changes from OSS - DO NOT MERGE#552
jacek-lewandowski wants to merge 372 commits intods-trunkfrom
STAR-1693-merge-ds-trunk

Conversation

@jacek-lewandowski
Copy link
Copy Markdown

No description provided.

Piotr Kołaczkowski and others added 30 commits May 27, 2022 11:53
(cherry picked from commit d834519)
(cherry picked from commit 2e3828b)
(cherry picked from commit ad9a9e6)
…g in query path (#222)

(cherry picked from commit d40f9e3)
(cherry picked from commit 41b5b8f)
(cherry picked from commit d83a0ae)
…227)

(cherry picked from commit 0969332)
(cherry picked from commit 91a753d)
(cherry picked from commit 56a9b56)
Add page size in bytes flag to protocol
Introduce PageSize object
Protocol version changes
No support for describe statement yet
Simplify SecondaryIndexManager page calculation
Add page size in bytes to DataLimits
Refactor pagers
Add / pull some tests
Add some toString implementations
Add PageSize to expected classes in DatabaseDescriptorRefTest

Fix AggregationPartitionIterator
So far we were passing the main page size to the AggregationPartitionIterator, which:
- was pointless because there is no paging when we aggregate everything
- it was actually harmful because AggregationPartitionIterator is a subclass of GroupByPartitionIterator and the later updates the subPager's limits with the minimum count of main page size and the number of remaining. It is correct if we use grouping aware limits, where count applies to the whole groups. But when we do aggregate everything, simply CQL limits are used and count limit applies to rows. Concluding, without fixing that we would limit the number of aggregated rows to the main page size which is not what we want

(cherry picked from commit e11d716)
(cherry picked from commit 13d4569)
(cherry picked from commit 4f65564)
This was failing because off-heap native clustering keys were
used in stats metadata without being copied, referencing memory
that could be overwritten.

Also fixes a problem creating retainable/minimized versions of
clustering bounds and boundaries.

(cherry picked from commit f00e340)
(cherry picked from commit f0904a3)
(cherry picked from commit 80a0383)
STAR-823: Refactor background compactions

CompactionManager.BackgroundCompactionCandidate task is scheduled on
compaction executor and when it detects there are compaction tasks
to run, it starts each compaction task as a separate job on the same
compaction executor and blocks until all tasks are finished.

When the pool size for the executor is n, and there are n background
tasks submitted in parallel, and all of them find that there are some
compactions to run, they will schedule them and block until the tasks
are finished. Though, the tasks cannot start because the pool is full
- we have n background tasks there waiting for the compaction tasks
that cannot start.

Another thing (perhaps minor) is that we use getActiveCount() on
the executor to check how many tasks it is currently running,
and based on that information either schedule new tasks or not.
The problem with this method is that it returns an approximate
result and should not be used for making such decisions.

To address those problems, running of background compactions was
refactored. The whole logic for background compactions was extracted
into a distinct class BackgroundCompactionsRunner. It allows flagging
CFSs for compaction and schedules scans through the flagged CFSs
on a dedicated executor so that scans and compaction tasks are no
longer sharing the same executor.

Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com>
(cherry picked from commit 96bf61c)
(cherry picked from commit 804b885)
(cherry picked from commit 2c0dcd7)
added more language tests

added brazilian

cql test passes

added support for setting a lucene analyzer

cql json test passes

fixed up some things

cleanup

added query analyzer

cleanup; added constants

added exception handling in unit test

added bad options unit tests

added char filter

removed comments and extra code

added illegal arg ex to LuceneAnalyzer#hasNext

added stop word support; prior to removal

reworked, no more stop words

added lowercase filter test

added ngram filter test

added simplepattern test; snowball off

added czech and porter

fixed alloc

removed commented out code

removed extra code

fixed minor issues

maybe fixex setMinMax

cleanup

reverted

reverted to a new byte[] per tokenized term

cleanup

cleanup

fixed sasi test

fixed unit test bug

cleanup

refactored for npe

addressed review comments

fixed npe bug

fixed a couple of bugs

removed json_ from options names; applied sonar comments

fixed sonar comments

fixed unit test bug

changed exception thrown

get -> create

fixed minor issue

(cherry picked from commit 3227a57)
(cherry picked from commit add6b8d)
(cherry picked from commit 1c60b2d)
The failure detector is now configurable via cassandra.custom_failure_detector_class

(cherry picked from commit 8127d43)

# The commit message #2 will be skipped:

# STAR-842 - fix up
(cherry picked from commit d9625b6)
… port (#245)

(cherry picked from commit adf34e3)
(cherry picked from commit f2a9b43)
…oken metadata) (#242)

Introduced TokenMetadataProvider to abstract access to TokenMetadata and make it pluggable.

(cherry picked from commit 8c0a970)
(cherry picked from commit 58b15c2)
Partition key ByteBuffer and columns btree were not taken
into account and some ByteBuffers were not measured correctly.

Also fixes flakes in MemtableSizeTest caused by including
allocator pool in measurements and updates it to test all
memtable allocation types.

(cherry picked from commit d8d3e8b)
(cherry picked from commit f8963ca)
* STAR-865: Porting metrics from cndb-884, riptano/bdp@03b23db6a5697baaf71d46d661c0ac1c908bc33e
and riptano/bdp/#19515

Co-authored-by: Zhao Yang <jasonstack.zhao@gmail.com>
Co-authored-by: Jake Luciani <tjake@users.noreply.github.com>

* STAR-865: Porting MicrometerChunkCacheMetrics from:
CNDB-161 Add MicrometerMetrics class
CNDB-780 Add Micrometer metrics for the chunk cache

Co-authored-by: Stefania Alborghetti <stefania.alborghetti@datastax.com>
(cherry picked from commit 5e0d889)

fix ConcurrencyFactorTest

the metrics require reset if we want to measure, especially the max values

(cherry picked from commit 3f89514)
This is a port of
- https://github.com/riptano/bdp/commit/b6f0a18cb832c62f05cdcbd9cdcc2923f2fa727f
- https://github.com/riptano/bdp/pull/19468

The first change set introduces QueryInfoTracker (QIT) interface and
hooks it to StorageProxy. The second adds ClientState to the interface.

The original QIT utilizes ReadReconciliationObserver in the ReadTracker
paths. Only onRow, onPartition and queried callbacks are utilized by
CNDB and thus only these methods are ported to Converged Cassandra
(CC). The callbacks are a bit different tho:
- The callback methods are added directly to ReadTracker as CC doesn't
have ReadReconciliationObserver. The class was added as a part of
NodeSync effort and it is rather superfluous. Porting the whole class would
add unnecessary complexity. Adding the required methods directly to the
ReadTracker makes the interface cleaner and easier to understand.
- CC operates on ReplicaPlans instead of plain host lists, that is why queried
was changed to onReplicaPlan.

(cherry picked from commit c32e91f)
(cherry picked from commit 093bc63)
Add support for:

- Registering new verbs at runtime
- Decorating existing verb handlers with another method
- Running a callback after the MessagingService sends a message

(cherry picked from commit c7d0ba0)
(cherry picked from commit e8e13ed)
(cherry picked from commit 7373d2f)
(cherry picked from commit 4134210)
FailingRepairTest uses serialization to pass Verbs back and forth
between the nodes during the test. Unforuntately, Verbs aren't
serializable anymore because they're no longer enums and this broke
the test.

Instead of passing a verb around, pass the verb Id and lookup the
verb inside of the test method.

(cherry picked from commit 47d0719)
(cherry picked from commit 6ed9a94)
…A-16663) (#262)

Ported from OSS commit d220d24.

(cherry picked from commit b762e3c)
(cherry picked from commit 34374c1)
The main objective of this refactoring is to enable compaction strategies
to operate on a lean abstraction of an sstable and the compaction space
instead of the full-blown open SSTableReader and ColumnFamilyStore. The
compaction process itself must still operate on SSTableReaders which
provide the mechanisms for reading the data; switching between the two
representations is done when compaction signals it is ready to start
compaction on a set of sstables via the realm's tryModify method.

Most files in the compaction package have been changed to rely solely on
CompactionSSTable and CompactionRealm, with the exception of
CompactionManager and BackgroundCompactionsRunner, which are part of the
CFS implementation.

Also does some small fixes and simplifications identified during the
refactoring:

- Fixes bloom filter size in Upgrader calculated for splitting to
  compaction strategy's sstable size limit while files weren't actually
  split.
- Stops checking an sstable's bloom filter if its minTimestamp is already
  above the current min for purge functions.
- Some collection construction/processing simplifications.
- Breaks up compaction -> CFS -> compaction reference cycles.
- Refactors some method to lower their complexity as requested by sonarcloud.
- Changes some remaining ...LatencyPerKb names to ...TimePerKb.

(cherry picked from commit 943ae99)
(cherry picked from commit e0fd645)
This replaces the Node-based walks and transformations. The result is
drastically less intermediate object creation, improved performance
and somewhat simpler code at the expense of the concept being a little
harder to understand initially.

Adds further documentation and expands tests for sliced tries.

(cherry picked from commit 2b3c4c5)
(cherry picked from commit 43c5206)
(cherry picked from commit e0982c6)
(cherry picked from commit aff4ab0)
(cherry picked from commit d3f271c)
* STAR-894 Port [CASSANDRA-16926] CEP-10 Phase 1: Mockable Filesystem

Co-authored-by: Benedict Elliott Smith <benedict@apache.org>
Co-authored-by: Aleksey Yeschenko  <aleksey@apache.org>
(cherry picked from commit 477fda8)
(cherry picked from commit c459c65)
Makes Snapshot class of DecayingEstimatedHistogramReservoir public, so
its API is accessible by external components such as CNDB. Adds public
getter to retrieve the array of bucket offsets.

(cherry picked from commit e57753c)
(cherry picked from commit 572fa86)
…oks (#271)

(cherry picked from commit c6cd4ee)
(cherry picked from commit fb5c241)
* STAR-909 Port DynamicSnitchSeverityProvider from bdp/cndb

Co-authored-by: Zhao Yang <jasonstack.zhao@gmail.com>
(cherry picked from commit 150e432)
(cherry picked from commit 21a4dd3)
* LogTransaction: add ILogTransactionsFactory to provide custom log
transaction

* UCS: Port CNDB-2134 to disable shards on UCS L0

* UCS: add CompactionAggregatePrioritizer to prioritize sstables based on remote file cache

* NativeLibrary: Add INativeLibrary interface to provide custom implementation

* SSTableWatcher: to discover custom component before opening sstables

* StorageProvider: support custom file system and change Descriptor to use URI

* StorageFeatureFlags: disable features that are not supported by custom file system

* StorageHandler: to reload sstable from custom file system

(cherry picked from commit e27ee69)
(cherry picked from commit e98d05a)
(cherry picked from commit 7d5184d)
(cherry picked from commit 54627bd)
(cherry picked from commit 41cb66e)
(cherry picked from commit 8b71940)
Add methods to PathUtils:
* deleteContent method that recursively deletes the contents of a directory, leaving the directory empty;
* listPaths methods to list all the paths in a directory, optionally using a provided filter.
Add method to Descriptor:
* validFilenameWithComponent to return the Component from an sstable file name

(cherry picked from commit 4f1c86b)
(cherry picked from commit ef840e5)
Extends the API of NativeLibrary to create a directory by providing
the path as string, so specialized implementations of a file system
don't to do additional conversion into Cassandra File, which is then
converted into string representation.

(cherry picked from commit 3ba1a16)
(cherry picked from commit aff175b)
JeremiahDJordan and others added 4 commits September 29, 2022 13:42
In STAR-1335 we ported most of CNDB-4090 but we missed out a call to
StorageProvider.invalidateFileSystemCache() which is required to
invalidate the remote storage cache in CNDB whenever we encounter
corruption.

This was discovered because the RemoteFileCacheCorruptedPageTest unit
test is failing.
Co-authored-by: Stefania Alborghetti <stef1927@users.noreply.github.com>
@jacek-lewandowski jacek-lewandowski marked this pull request as ready for review October 11, 2022 06:53
@jacek-lewandowski jacek-lewandowski force-pushed the STAR-1693-merge-ds-trunk branch 2 times, most recently from e8758e5 to 9b7dc0d Compare October 11, 2022 12:36
mfleming and others added 10 commits October 12, 2022 17:03
#555)

* STAR-1697: Port keyspace renaming (KeyspaceMetadata::rename) from DB-3896

Required by CNDB-3170 and CNDB-4909.
This patch adds a way to customize the compaction overhead, i.e. the transient
amount of space required by a compaction whilst both input and output sstables
are present. In BDP this is just estimated to be the size of the input sstables.

It's unclear if we can improve this in CNDB, but I kept the refactoring because
initially I got confused thinking that in CNDB we could just waive this requirement
since the input sstables are in the file cache. So I think it's good to spell
out why we use the input sstable sizes by encapsulating the calculation in a
method with javadoc.

The patch also adds a warning to the logs: if a compaction cannot be performed
because the space overhead is larger than the space available, then the logs now
contain this information. Without this, troubleshooing why compaction tasks are
skipped is quite hard. This warning was already present in BDP but was missing
for CNDB.

Port CNDB-4385
…fication

This commit changes the API of UCS as follows:

- The Bucket inner class is now public
- The method for extracting shards with buckets is now public, and it
  accepts a custom list of sstables

These changes are required for CNDB, so that we can classify all live sstables
and visualize their corresponding shards and buckets in a diagnostic tool such as
Autobot.

The comments for warnIfSizeAbove have been clarified and moved to the method Javadoc.

Port of CNDB-4385
Port of CNDB-5113
Fixed DroppedColumn#toCQLString by using the CQL String version of the
column name, which also double quotes the name if it's in mixed case.

Co-authored-by: Massimiliano Tomassi <max.tomassi@datastax.com>
Co-authored-by: Stefania Alborghetti <stef1927@users.noreply.github.com>
…arn about it (#558)


Co-authored-by: Matt Fleming <mfleming@users.noreply.github.com>
…le txn bug

Port BDP part of CNDB-4035: restore sstables if they cannot be dropped and fix lifecycle txn bug so that we SSTables are added back to the live set if we fail to drop them.
Port CNDB-4855
Fixed streaming to connect back using peer preferred address instead of
Channel#remoteAddress
There are a couple of things here:
- `unsafeFree` method in `BufferPool` did not do what it was probably expected to do - that is, the direct buffer was not released properly because when it is allocated in `allocateDirectAligned` the method actually returns a slice of the original buffer, while the only reference to the original buffer is in `attachment` field of the returned buffer. It was mitigated by using a new method to clean, which can release the parent buffer by recursively go through the attachment hierarchy
- for in-jvm dtests, releasing of all buffers in buffer pools was added as the very last step of instance shutdown; it fixes memory leaking between the subsequent instance restarts; in production run, we just stop the JVM and the buffers are lost, but in case of those dtest we need to deal with that explicitly
Copy link
Copy Markdown

@blambov blambov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Allocator/Cloner changes look good to me.

@sonarqubecloud
Copy link
Copy Markdown

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug C 2 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 83 Code Smells

85.6% 85.6% Coverage
0.0% 0.0% Duplication

michaeljmarshall added a commit that referenced this pull request Feb 20, 2026
…2042)

### What is the issue

Fixes: https://github.com/riptano/cndb/issues/15527
CNDB test PR: https://github.com/riptano/cndb/pull/16797

### What does this PR fix and why was it fixed

This PR upgrades jvector, which brings several improvements. Here are
the git commits brought in:

```
8b3e93cf (tag: 4.0.0-rc.8) chore: update changelog for 4.0.0-rc.8 (#627)
9d0488e5 release 4.0.0-rc.8 (#626)
570bd118 Refactor parallel writer (#608)
20c348ec Move buffer position in ByteBufferIndexWriter#writeFloats (#607)
d9ddce51 Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE (#610)
d663b4f7 add config options for regression testing (#609)
7e493eee On-disk index cache for the Grid benchmark harness (#612)
e263cc80 Improved dataset loading; fixes, safeties, diagnostics, and better feedback (#613)
6b235ce7 bump to next SNAPSHOT (#605)
84bf5708 (tag: 4.0.0-rc.7) chore: update changelog for 4.0.0-rc.7 (#604)
fceeb885 release 4.0.0-rc.7 (#603)
51807cba add protection against bad ordinal mappings (#602)
6ca3b5e2 adding memory and disk usage stats to bench tests (#591)
a66fd914 Fix OnDiskGraphIndex#ramBytesUsed NPE (#588)
0ca5a392 Move float bulk-write into IndexWriter to enforce endianness (#577)
a6c6c09b Add diversityScoreFunctionFor to avoid creation of wrapper object (#592)
977c21d4 Relax the threshold of a flaky test related to an experimental feature (#598)
fa808d69 adding average nodes visited to benchmark tests (#552)
3bd15e70 Virtualize and Modularize DataSetLoader logic (#593)
42259e9f Speed up ivec reads by buffering (#584)
f967f1c9 virtualize DataSet (#589)
55f902f4 turn off parallel writes in grid (#582)
019a241d Parallelize graph writes (#542)
02fea879 Save allocation of a large array in PQVectors.encodeAndBuild (#574)
32a51821 javadoc for base [graph] (#548)
4eb607f8 javadoc for base [disk,exceptions] (#547)
30e8932c Enable the fused graph index  (#561)
d8848fc6 Start development on 4.0.0-rc.7-SNAPSHOT (#573)
c57f3a62 (tag: 4.0.0-rc.6) chore: update changelog for 4.0.0-rc.6 (#572)
214b7c20 release 4.0.0-rc.6 (#571)
e3686999 fix javadoc error (#570)
88669887 Ignoring testIncrementalInsertionFromOnDiskIndex_withNonIdentityOrdinalMapping and adding a TODO in buildAndMergeNewNodes (#569)
29a943e1 Computation of reconstruction errors for vector compressors (#567)
d8e9cb16 Add NVQ paper in README (#560)
d5cbe658 Add ImmutableGraphIndex.isHierarchical (#563)
b484dae2 Harden tests for heap graph reconstruction (#543)
9471c57d Make the thresholds in TestLowCardinalityFiltering tighter (#559)
21e4a226 Begin development on 4.0.0-rc.6 (#558)
4f661d99 Revert "Start development on 4.0.0-rc.6-SNAPSHOT"
fdee5779 Start development on 4.0.0-rc.6-SNAPSHOT
```

### SAI Version Bump

Adds a new sai on disk version: `fa`

### Fused PQ

With this version, we are adding a new, experimental feature to write PQ
vectors fused into the graph. In doing so, we are able to skip writing
the PQ vectors to the PQ file, which results in significant memory
savings since the PQ vectors in the `CassandraDiskAnn` graph searcher
consumers `O(n)` memory based on the number of vectors and their
quantized size. The fused pq vectors mostly fit within the page cache as
we read the node and its neighbors from disk, so we see minimal latency
reduction due to this change, though further testing is required to see
the real impact.

In order to enable fused pq, the runtime needs
`cassandra.sai.latest.version=fa` or greater and
`cassandra.sai.vector.enable_fused=true`. Note that because this feature
is still experimental, `cassandra.sai.vector.enable_fused` defaults to
`false`.

Another experimental feature introduced in this commit via the jvector
upgrade is parallel graph encoding and writing to disk. Writing the
fused graph requires increased CPU time to encode the graph node and we
write more bytes to disk, so this parallelism is likely necessary to
keep vector index creation/compaction times down. The key configurations
available with their associated defaults:

```java
    // When building a compaction graph, encode layer 0 nodes in parallel and subsequently use async io for writes.
    // This feature is experimental, so defaults to false.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_ENABLED("cassandra.sai.vector.encode_and_write_graph_in_parallel.enabled", "false"),
    // When parallel graph encoding is enabled, the number of threads to use for encoding. Defaults to 0, meaning
    // use all available processors as reported by the JVM.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_NUM_THREADS("cassandra.sai.vector.encode_and_write_graph_in_parallel.num_threads", "0"),
    // When parallel graph encoding is enabled, whether to use director buffers. Defaults to false, meaning heap
    // buffers are used. A buffer will be allocated per encoding thread. The size of each buffer is the size
    // of the encoded graph node at layer 0, which varies based on graph feature settings.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_USE_DIRECT_BUFFERS("cassandra.sai.vector.encode_and_write_graph_in_parallel.use_direct_buffers", "false"),
```

### OnDiskVectorValues and OnDiskVectorValuesWriter

`OnDiskVectorValues` is now in its own file and is now thread safe in
order to account for some necessary implementation details within
jvector. Added `OnDiskVectorValuesWriter` to improve test coverage and
to abstract away the flush issues associated with
`BufferedRandomAccessWriter` as described in
datastax/jvector#562.

### Verification

This PR also introduces new benchmarks as well as improved unit testing.
The new benchmarks verify the performance of the `OnDiskVectorValues`
and `OnDiskVectorValuesWriter` to confirm (at least directionally) the
time associated with read and write operations.

New tests have been added to verify that when we iterate over an
sstable's rows, we are able to assert that the sstable's vector value's
similarity to the one stored in the vector graph is ~1. This testing is
valuable in that it confirms the row id to ordinal mapping is correct at
every node. Previously, we relied on recall results to verify this for
us. This new pattern allows us to confirm _every_ node, which is more
thorough and removes most edge cases that might have led to partially
correct graphs that may have achieved acceptable recall.
driftx pushed a commit that referenced this pull request Apr 27, 2026
…2042)

Fixes: riptano/cndb#15527
CNDB test PR: riptano/cndb#16797

This PR upgrades jvector, which brings several improvements. Here are
the git commits brought in:

```
8b3e93cf (tag: 4.0.0-rc.8) chore: update changelog for 4.0.0-rc.8 (#627)
9d0488e5 release 4.0.0-rc.8 (#626)
570bd118 Refactor parallel writer (#608)
20c348ec Move buffer position in ByteBufferIndexWriter#writeFloats (#607)
d9ddce51 Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE (#610)
d663b4f7 add config options for regression testing (#609)
7e493eee On-disk index cache for the Grid benchmark harness (#612)
e263cc80 Improved dataset loading; fixes, safeties, diagnostics, and better feedback (#613)
6b235ce7 bump to next SNAPSHOT (#605)
84bf5708 (tag: 4.0.0-rc.7) chore: update changelog for 4.0.0-rc.7 (#604)
fceeb885 release 4.0.0-rc.7 (#603)
51807cba add protection against bad ordinal mappings (#602)
6ca3b5e2 adding memory and disk usage stats to bench tests (#591)
a66fd914 Fix OnDiskGraphIndex#ramBytesUsed NPE (#588)
0ca5a392 Move float bulk-write into IndexWriter to enforce endianness (#577)
a6c6c09b Add diversityScoreFunctionFor to avoid creation of wrapper object (#592)
977c21d4 Relax the threshold of a flaky test related to an experimental feature (#598)
fa808d69 adding average nodes visited to benchmark tests (#552)
3bd15e70 Virtualize and Modularize DataSetLoader logic (#593)
42259e9f Speed up ivec reads by buffering (#584)
f967f1c9 virtualize DataSet (#589)
55f902f4 turn off parallel writes in grid (#582)
019a241d Parallelize graph writes (#542)
02fea879 Save allocation of a large array in PQVectors.encodeAndBuild (#574)
32a51821 javadoc for base [graph] (#548)
4eb607f8 javadoc for base [disk,exceptions] (#547)
30e8932c Enable the fused graph index  (#561)
d8848fc6 Start development on 4.0.0-rc.7-SNAPSHOT (#573)
c57f3a62 (tag: 4.0.0-rc.6) chore: update changelog for 4.0.0-rc.6 (#572)
214b7c20 release 4.0.0-rc.6 (#571)
e3686999 fix javadoc error (#570)
88669887 Ignoring testIncrementalInsertionFromOnDiskIndex_withNonIdentityOrdinalMapping and adding a TODO in buildAndMergeNewNodes (#569)
29a943e1 Computation of reconstruction errors for vector compressors (#567)
d8e9cb16 Add NVQ paper in README (#560)
d5cbe658 Add ImmutableGraphIndex.isHierarchical (#563)
b484dae2 Harden tests for heap graph reconstruction (#543)
9471c57d Make the thresholds in TestLowCardinalityFiltering tighter (#559)
21e4a226 Begin development on 4.0.0-rc.6 (#558)
4f661d99 Revert "Start development on 4.0.0-rc.6-SNAPSHOT"
fdee5779 Start development on 4.0.0-rc.6-SNAPSHOT
```

Adds a new sai on disk version: `fa`

With this version, we are adding a new, experimental feature to write PQ
vectors fused into the graph. In doing so, we are able to skip writing
the PQ vectors to the PQ file, which results in significant memory
savings since the PQ vectors in the `CassandraDiskAnn` graph searcher
consumers `O(n)` memory based on the number of vectors and their
quantized size. The fused pq vectors mostly fit within the page cache as
we read the node and its neighbors from disk, so we see minimal latency
reduction due to this change, though further testing is required to see
the real impact.

In order to enable fused pq, the runtime needs
`cassandra.sai.latest.version=fa` or greater and
`cassandra.sai.vector.enable_fused=true`. Note that because this feature
is still experimental, `cassandra.sai.vector.enable_fused` defaults to
`false`.

Another experimental feature introduced in this commit via the jvector
upgrade is parallel graph encoding and writing to disk. Writing the
fused graph requires increased CPU time to encode the graph node and we
write more bytes to disk, so this parallelism is likely necessary to
keep vector index creation/compaction times down. The key configurations
available with their associated defaults:

```java
    // When building a compaction graph, encode layer 0 nodes in parallel and subsequently use async io for writes.
    // This feature is experimental, so defaults to false.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_ENABLED("cassandra.sai.vector.encode_and_write_graph_in_parallel.enabled", "false"),
    // When parallel graph encoding is enabled, the number of threads to use for encoding. Defaults to 0, meaning
    // use all available processors as reported by the JVM.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_NUM_THREADS("cassandra.sai.vector.encode_and_write_graph_in_parallel.num_threads", "0"),
    // When parallel graph encoding is enabled, whether to use director buffers. Defaults to false, meaning heap
    // buffers are used. A buffer will be allocated per encoding thread. The size of each buffer is the size
    // of the encoded graph node at layer 0, which varies based on graph feature settings.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_USE_DIRECT_BUFFERS("cassandra.sai.vector.encode_and_write_graph_in_parallel.use_direct_buffers", "false"),
```

`OnDiskVectorValues` is now in its own file and is now thread safe in
order to account for some necessary implementation details within
jvector. Added `OnDiskVectorValuesWriter` to improve test coverage and
to abstract away the flush issues associated with
`BufferedRandomAccessWriter` as described in
datastax/jvector#562.

This PR also introduces new benchmarks as well as improved unit testing.
The new benchmarks verify the performance of the `OnDiskVectorValues`
and `OnDiskVectorValuesWriter` to confirm (at least directionally) the
time associated with read and write operations.

New tests have been added to verify that when we iterate over an
sstable's rows, we are able to assert that the sstable's vector value's
similarity to the one stored in the vector graph is ~1. This testing is
valuable in that it confirms the row id to ordinal mapping is correct at
every node. Previously, we relied on recall results to verify this for
us. This new pattern allows us to confirm _every_ node, which is more
thorough and removes most edge cases that might have led to partially
correct graphs that may have achieved acceptable recall.
driftx pushed a commit that referenced this pull request Apr 28, 2026
…2042)

Fixes: riptano/cndb#15527
CNDB test PR: riptano/cndb#16797

This PR upgrades jvector, which brings several improvements. Here are
the git commits brought in:

```
8b3e93cf (tag: 4.0.0-rc.8) chore: update changelog for 4.0.0-rc.8 (#627)
9d0488e5 release 4.0.0-rc.8 (#626)
570bd118 Refactor parallel writer (#608)
20c348ec Move buffer position in ByteBufferIndexWriter#writeFloats (#607)
d9ddce51 Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE (#610)
d663b4f7 add config options for regression testing (#609)
7e493eee On-disk index cache for the Grid benchmark harness (#612)
e263cc80 Improved dataset loading; fixes, safeties, diagnostics, and better feedback (#613)
6b235ce7 bump to next SNAPSHOT (#605)
84bf5708 (tag: 4.0.0-rc.7) chore: update changelog for 4.0.0-rc.7 (#604)
fceeb885 release 4.0.0-rc.7 (#603)
51807cba add protection against bad ordinal mappings (#602)
6ca3b5e2 adding memory and disk usage stats to bench tests (#591)
a66fd914 Fix OnDiskGraphIndex#ramBytesUsed NPE (#588)
0ca5a392 Move float bulk-write into IndexWriter to enforce endianness (#577)
a6c6c09b Add diversityScoreFunctionFor to avoid creation of wrapper object (#592)
977c21d4 Relax the threshold of a flaky test related to an experimental feature (#598)
fa808d69 adding average nodes visited to benchmark tests (#552)
3bd15e70 Virtualize and Modularize DataSetLoader logic (#593)
42259e9f Speed up ivec reads by buffering (#584)
f967f1c9 virtualize DataSet (#589)
55f902f4 turn off parallel writes in grid (#582)
019a241d Parallelize graph writes (#542)
02fea879 Save allocation of a large array in PQVectors.encodeAndBuild (#574)
32a51821 javadoc for base [graph] (#548)
4eb607f8 javadoc for base [disk,exceptions] (#547)
30e8932c Enable the fused graph index  (#561)
d8848fc6 Start development on 4.0.0-rc.7-SNAPSHOT (#573)
c57f3a62 (tag: 4.0.0-rc.6) chore: update changelog for 4.0.0-rc.6 (#572)
214b7c20 release 4.0.0-rc.6 (#571)
e3686999 fix javadoc error (#570)
88669887 Ignoring testIncrementalInsertionFromOnDiskIndex_withNonIdentityOrdinalMapping and adding a TODO in buildAndMergeNewNodes (#569)
29a943e1 Computation of reconstruction errors for vector compressors (#567)
d8e9cb16 Add NVQ paper in README (#560)
d5cbe658 Add ImmutableGraphIndex.isHierarchical (#563)
b484dae2 Harden tests for heap graph reconstruction (#543)
9471c57d Make the thresholds in TestLowCardinalityFiltering tighter (#559)
21e4a226 Begin development on 4.0.0-rc.6 (#558)
4f661d99 Revert "Start development on 4.0.0-rc.6-SNAPSHOT"
fdee5779 Start development on 4.0.0-rc.6-SNAPSHOT
```

Adds a new sai on disk version: `fa`

With this version, we are adding a new, experimental feature to write PQ
vectors fused into the graph. In doing so, we are able to skip writing
the PQ vectors to the PQ file, which results in significant memory
savings since the PQ vectors in the `CassandraDiskAnn` graph searcher
consumers `O(n)` memory based on the number of vectors and their
quantized size. The fused pq vectors mostly fit within the page cache as
we read the node and its neighbors from disk, so we see minimal latency
reduction due to this change, though further testing is required to see
the real impact.

In order to enable fused pq, the runtime needs
`cassandra.sai.latest.version=fa` or greater and
`cassandra.sai.vector.enable_fused=true`. Note that because this feature
is still experimental, `cassandra.sai.vector.enable_fused` defaults to
`false`.

Another experimental feature introduced in this commit via the jvector
upgrade is parallel graph encoding and writing to disk. Writing the
fused graph requires increased CPU time to encode the graph node and we
write more bytes to disk, so this parallelism is likely necessary to
keep vector index creation/compaction times down. The key configurations
available with their associated defaults:

```java
    // When building a compaction graph, encode layer 0 nodes in parallel and subsequently use async io for writes.
    // This feature is experimental, so defaults to false.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_ENABLED("cassandra.sai.vector.encode_and_write_graph_in_parallel.enabled", "false"),
    // When parallel graph encoding is enabled, the number of threads to use for encoding. Defaults to 0, meaning
    // use all available processors as reported by the JVM.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_NUM_THREADS("cassandra.sai.vector.encode_and_write_graph_in_parallel.num_threads", "0"),
    // When parallel graph encoding is enabled, whether to use director buffers. Defaults to false, meaning heap
    // buffers are used. A buffer will be allocated per encoding thread. The size of each buffer is the size
    // of the encoded graph node at layer 0, which varies based on graph feature settings.
    SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_USE_DIRECT_BUFFERS("cassandra.sai.vector.encode_and_write_graph_in_parallel.use_direct_buffers", "false"),
```

`OnDiskVectorValues` is now in its own file and is now thread safe in
order to account for some necessary implementation details within
jvector. Added `OnDiskVectorValuesWriter` to improve test coverage and
to abstract away the flush issues associated with
`BufferedRandomAccessWriter` as described in
datastax/jvector#562.

This PR also introduces new benchmarks as well as improved unit testing.
The new benchmarks verify the performance of the `OnDiskVectorValues`
and `OnDiskVectorValuesWriter` to confirm (at least directionally) the
time associated with read and write operations.

New tests have been added to verify that when we iterate over an
sstable's rows, we are able to assert that the sstable's vector value's
similarity to the one stored in the vector graph is ~1. This testing is
valuable in that it confirms the row id to ordinal mapping is correct at
every node. Previously, we relied on recall results to verify this for
us. This new pattern allows us to confirm _every_ node, which is more
thorough and removes most edge cases that might have led to partially
correct graphs that may have achieved acceptable recall.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.