Skip to content

Conversation

@maheshrajamani
Copy link

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

szymon-miezal and others added 30 commits March 6, 2025 12:38
### What is the issue
Node crashes during node replacements result in hibernated nodes that
cannot join the cluster anymore due to a lack of SYN messages from seeds.

### What does this PR fix and why was it fixed
Port DB-1482, which allows the use a jmx endpoint on a seed to bring the
hibernated node back to the gossiping candidate list.

Tested via: datastax/cassandra-dtest#75.
Fixes: riptano/cndb#13196

### What is the issue
We were incorrectly sharing a file reader across threads in BM25
queries, which can lead to invalid results, as I reproduced in the test.

### What does this PR fix and why was it fixed
The PR fixes the issue by creating a `DocLengthsReader` object per
query.
### What is the issue
Fixes riptano/cndb#13203

The underlying problem is that the `OutboundConnection` needs to store
the message version from the peer. Otherwise, when the local
`current_version` is higher than the remove `current_version`, we store
the wrong value (we were storing the local value, not the agreed upon
value).

The PR also fixes an issue in the ANN_OPTIONS serialization logic that
we can hit in rare cases where the peers haven't yet connected, so the
validation logic that would prevent serialization due to a peer having
too low of a message version is skipped.

### What does this PR fix and why was it fixed
This PR fixes several issues related to upgrades by improving handshake
logic.

---------

Co-authored-by: Sergio Bossa <sergio.bossa@gmail.com>
### What is the issue
CNDB-12599

### What does this PR fix and why was it fixed
Exposes the space requirements of compaction picks and compaction tasks
for CNDB's compaction space management.
Fixes riptano/cndb#12683

Table names are used in file names. Since the table names were not
validated on its length, creating or flushing a table with too long name
fails on too long file name.
There are two cases with using table names in file names:
- keyspace_name .table_name-controller-config.JSON
- table_name-32chars_table_id 

The maximum allowed file name size is 255 chars. Thus, 
- keyspace and table names, which are together longer than 231 chars,
will fail.
- a table name, which is longer than 222 chars, will fail.

This PR checks that creating new table name should not have too long
table name.
New tables are identified through the query's client state, which should
not be internal.
The limit is 222 chars for combined keyspace and table names. This is
more restrictive than necessary, but is easier to explain in the
documentation.

The change is tested in CNDB that creating new tables with long names
are prevented, but existing tables still work.
…essaging version (#1624)

This makes sure the proper version is used, as
following CNDB-13203, each connection maintains its own version,
separated from the one recorded in the MessagingService.

### What is the issue
`ANNOptions#validate()` was previously checking endpoint versions via
`MessagingService#versions`: but following #13203, in order to resolve a
race condition, each endpoint connection tracks its own
`EndpointMessagingVersions` object, with the version negotiated during
handshake, while `MessagingService#versions` is updated with the max
version of each endpoint. It follows checking
`MessagingService#versions` might not rely on the right version used for
the connection.

### What does this PR fix and why was it fixed
This PR introduces a new `MessagingService` method to check the
connection version, and uses it inside `ANNOptions#validate()`.
Trying to create an SAI or other index on a column, which name is either
long or has
non-alphanumeric characters, fails, while the table was successfully
created with such name. This happens if an index name is not provided.
The reason for this limitation is that the column name is used for
generating the default index name, which is used in the file names.

Fixes riptano/cndb#12503, fixes
riptano/cndb#12609

This PR removes the check for column names during an index creation.
Thus it allows to create indexes on existing columns with any names
including qualified names and long names.

1. The check was preventing qualified column names with characters,
which cannot be used in file names. This was never an issue as the index
name construction from table and column names already removes any
non-alphanumeric or non-underscore characters.
2. The check was preventing long column names. This limitation is
removed by
  - Adding calculation of maximum amount characters, which can be used in
index name, so it will allow to construct file names without exceeding
the limit of 255 chars (255 is added as a constant).
  - Then the index name is truncated. 
3. The uniqueness of index names is already fixed by existing
implementation, which adds counter prefix when needed.

This change to generated index name does not affect existing indexes as
the existing name from `IndexMetadata` is always serialized to JSON,
thus avoiding to re-generating the name.

This fix required to fix CQL tester, since it was assuming that the
restriction is in place. So column name regex pattern matcher is fixed
in `createIndex` to allow non-restricted column names. There might still
be possible corner cases on some column names used for generating index
names, which will not work with `createIndex`. In such case `execute`
should be used.

It also includes several fixes in affected files:
- Improve the name and usage of constants.
- Fix warnings with minor improvements in affected files.
The goal of this commit is to allow low-level code, typically at the
level of file reads (say, a custom `FileChannel` implementation), to
obtain context on which higher level operation they correspond to. A
typical use case may for tiered storage where `FileChannel`
implementations may forward reads to remote storage, and where having
context on the overall operation could help, say, take prior work done
to set timeout, or aggregate metrics per "user-level" operations (rather
than per `FileChannel` operation).

To do this, we reuse the existing `ExecutorLocals` mechanism, and simply
have the top-level operation setup the proper context as an
`ExecutorLocal`, allowing it to be accessed by any low-level operations
operating on behalf of that operation. This is, in many way, similar to
tracing, but instead of a `TraceState` that collect what happens during
the operation, it is a relatively flexible notion of `OperationContext`.

As of this patch, this feature is fairly barebone and mostly exists for
extensions in the following sense:
1. only user reads (`ReadCommand` execution) currently sets up such a
   context. The code is written in such a way that adding support for
   other operations should be easy, but this is not done.
2. the context set by reads by default has barely any information: it
   merely has a `toString()` method that can roughly tell what the
   operation itself is, and so could have use for debugging, but not
   much more. Further, that context is not read by anything in this
   patch. However, said context are created through a "factory" and
   the factory class is configurable through a system property. So
   extension can override the factory in order to create contexts
   with more information/state, and they fetch/use those context where
   they see fit.
patch by Berenguer Blasi; reviewed by Andres de la Peña for CASSANDRA-17333
### What is the issue
When the node gets replaced, it announces itself as hibernated (one of
the silent-shutdown states). When the node replacement fails, as the
other nodes see the replacing node in that state then the replacing node
won't receive any gossip messages from the seed at subsequent startup -
which ends up with an exception.

### What does this PR fix and why was it fixed
This patch adds an explicit shutdown announcement via gossip to let
other nodes know that the node was explicitly shut down - as it was due
to the exception. That allows other nodes (seeds in particular) to
contact the replacing node at its next startup, thus allowing it to retry
the replacement.
When a wildcard selection is used, the score column
must be included in the selection.

Fixes:
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 10.87.217.86:9042 us-central1>: <Error from server: code=0000 [Server error] message="java.lang.ClassCastException:
class org.apache.cassandra.serializers.UTF8Serializer cannot be cast to class org.apache.cassandra.db.marshal.VectorType$VectorSerializer
(org.apache.cassandra.serializers.UTF8Serializer and org.apache.cassandra.db.marshal.VectorType$VectorSerializer are in unnamed module of loader org.apache.felix.framework.BundleWiringImpl$BundleClassLoader @446c3920)">})
Reject creating indexes with analysis options on frozen collections.
We don't have a way to correctly index them, and we don't support querying either.
Part of riptano/cndb#13375

### What is the issue

Using keyspace name length in the error message for too long table name
is confusing, since the length might include CNDB's generated prefix
invisible to users.

### What does this PR fix and why was it fixed

Reformulates the error message for too long table name to avoid using
the keyspace name length, since it might contain CNDB generated prefix,
which is not visible to users.
Creating an index with too long index names leads to file errors or
incorrect results.

This PR prevents creating new indexes, i.e., not internal, with provided
index names longer than 182 characters. This constant is based on the
most restrictive requirement from different index types. It comes from
SAI by subtracting the longest file name addition from the file name
limit of 255 characters.
InetAddressAndPort fields were previously InetAddress, and serialised as either byte[] addresses or text hostnames.

Adds two unit tests, simple serdes of InetAddress to InetAddressAndPort, and parsing of DSE-6 files (kept in `test/data/legacy-metadata/dse-6.8/nodes`).
The 4th component can be used to distinguish version which introduce
breaking changes, especially we want to retain the appropriate version
ordering.
…1641)

### What is the issue
Fixes riptano/cndb#13403 by implementing one
of the solutions.

### What does this PR fix and why was it fixed

The current synthetic column logic relies on convention to know that the
synthetic column's source column is the SAI ORDER BY column. This seems
fragile and will break if we find a reason to introduce a new synthetic
column. I propose that we serialize the source column name so that we
can more easily support multiple synthetic columns.

The urgency for this change is that we want to get it in before we are
locked into a given serde protocol.

Note however, that I haven't solved the primary issue described in
riptano/cndb#13402.

CNDB tests riptano/cndb#13418
…re to use SSTableReader and add more loggings to index tracking (#1644)
…NOT AVAILABLE when they cannot query because an index is building.

We add a new reason, `RequestFailureReason.INDEX_BUILD_IN_PROGRESS`, to differentiate indexes that are currently building in error messages

Co-authored-by: Sergio Bossa <sergio.bossa@gmail.com>
…ead of the query analyzer (#1615)

Fix `RowFilter.AnalyzableExpression#numFilteredValues`, which is
used by the `query_filters` guardrail, to use the query analyzer, instead
of the index analyzer. That prevents wrong triggerings of the guardrail.
### What is the issue
The customer requested an upgrade for some libraries (CVEs)

### What does this PR fix and why was it fixed
This PR upgrades to latest stable versions

- snappy to 1.1.10.7 fixing CVE-2023-43642
- guava to 33.4.0-jre fixing CVE-2023-2976 and CVE-2020-8908
- jackson to 2.18.0 fixing CWE-400
- snakeyaml to 2.4 fixing CVE-2022-1471
…ing a row (#1653)

### What is the issue
see CNDB-13480


### What does this PR fix and why was it fixed
Exclude deleted rows while computing terms frequencies and preparing for
computing the scores
…on queries (#1638)

Queries with index-based ordering (ANN, BM25, generic `ORDER BY`) don't
support CQL's aggregation queries, such as `GROUP BY` or `SELECT
COUNT(*)`. The reason is that the coordinator internally uses paging to
fetch more results from the replicas until the aggregated limit is
satisfied. Since top-k index-based ordering doesn't support paging,
these queries will fail.

We should reject these queries until we have a way to support them.
…1200)

Add support for updating a row's terms in the `TrieMemoryIndex`.

There are two general cases for row to term mappings: 1:one and 1:many.

In the 1 to 1 case, update is trivial. We check before insertion that we
have a new value, then we insert the new value and remove the old value.

In the 1 to many case, the solution is somewhat bespoke, but is well
documnted and encapsulated. The main challenge
comes in the form of when to add/remove terms without increasing the
overall time complexity of the update operation. In this design, a
majority of the complexity is in the PrimaryKeys object. There, we
monitor seen keys and use whether we've seen the key to decide whether
to reset the frequency counter and then subsequently to decide whether
to leave the key or remove it (the term could be in both the old and the
new values).

This whole design hinges on the fact that we synchronize additions to
the trie.

The one drawback in terms of performance is that collections are added
and removed within the sync block. This change is necessary to deal with
encoding nuances unless we want to largely refactor the class.

The motivation for needing updates is to ensure that we have unique
primary keys in the postings list for scalar values and for a given map
key. Supporting updates means that the order of terms within an index is
internally consistent.
### What is the issue
Fixes riptano/cndb#13446

### What does this PR fix and why was it fixed
This pr fixes an issue in the search-then-sort path where we never apply
the query term filter. The change is quite simple because we were
already scoring these terms, so we just need to filter out the ones that
do not have frequency of at least 1 for each query term.
…be indexed with a non-analyzed index (#1656)

### What is the issue
...
ORDER BY queries return wrong result if we have an analyzed index, as
reported in CNDB-13464

### What does this PR fix and why was it fixed
...
ORDER BY should not work with an analyzed index at all. This PR fixes it
and improves the error message.
### What is the issue
The original AA SAI format was writing to disk with different
byte-comparable version depending on the version of the sstable files.
This causes problems with the ByteComparable version checking that was
recently introduced.

### What does this PR fix and why was it fixed
- Introduces a new `ByteComparable.Preencoded` interface meant to store
and surface the byte-comparable version something was encoded with.
- Changes tries to always return `Preencoded` type, so that callers may
easily query the version.
- Changes SAI code to use the correct byte-comparable version when that
is necessary, and to ignore or use Preencoded's version when it is
not feasible or important.
Work done in HCD-82 was further refined in CASSANDRA-20368 and
CASSANDRA-20450.
This backports C* changes in order to minimize the diff in future
backports.
adelapena and others added 27 commits November 26, 2025 16:29
…ows/tombstones (#2132)

The metrics partitionsRead, rowsFiltered, rowsPreFiltered and shadowedPrimaryKeyCount,
which don't always work as expected, are replaced by these metrics:

* keysFetched: Number of partition/row keys that will be used to fetch rows from the base table.
* partitionsFetched: Number of live partitions fetched from the base table, before post-filtering and sorting.
  Note that currently ANN fetches one partition per index key, without any grouping of same-partition clusterings.
* partitionsReturned: Number of live partitions returned to the coordinator, after post-filtering and sorting.
* partitionTombstonesFetched: Number of deleted partitions found when reading the base table.
* rowsFetched: Number of live rows fetched from the base table, before post-filtering and sorting.
* rowsReturned: Number of live rows returned to the coordinator, after post-filtering and sorting.
* rowTombstonesFetched: Number deleted rows or row ranges found when reading the base table.

StorageAttachedIndexSearcher is modified to use the command timestamp, rather than getting the current
time every time a query timestamp is needed, which was possibly buggy and inefficient.
While working on CNDB-15608, IntelliJ lint complains were noticed, which
are not related to the actual changes in the patch port. Thus I fix them
in this separate commit to avoid unnecessary noise while working on the
actual patch port.

Many of the changes are align what I have already merged earlier in
other PRs.
Some of the changes might not match preferences from others and I am
open for discussion.

The changes include:
- Remove unused imports
- Use the formatter of the logger instead of string concatenation
- Use method instead of lambda
- Remove unnecessary suppression of resource warnings
- Simplify Boolean conditions
- Remove unnecessary modifiers in interfaces
- Fix typos
- Fixing links in javadoc comments
- Add static modifier to nested classes
- Remove class fields when not used
- Remove unnecessary throws in method signatures
- Use final when recommended
- Remove unused method arguments
- Replace single char strings with chars
- Remove unnecessary null variable initialization
- Replace assert true with assert equal
- Change order of assert arguments to have expected value first
- Remove unnecessary explicit casting
PartitionAwarePrimaryKeyMap implements overcomplicated `ceiling` method
calling `exactRowIdOrInvertedCeiling`.

This commit Simplifies PartitionAwarePrimaryKeyMap.ceiling to use the corresponding
correct method from the reader directly.
This can be seen as a follow up to
https://github.com/datastax/cassandra/pull/1096/files#diff-c5011580ab9b0d99d9e504570c4cccb152221d3dbe62c8a956e83fce9070b380
…ands (#2147)

Add information about fetched/returned/deleted partitions and rows
to the logging reports for non-SAI slow queries.

There is a new system property named `cassandra.monitoring_execution_info_enabled`
to disable this feature, which is enabled by default.
Extends IndexQuerySupport framework with new test scenarios
mixing row-aware and non-row-aware index versions.

In the added scenarios, rows are first inserted and flushed
into AA indexes, then the node(s) are upgraded to a row-aware index
SAI version and finally more rows are inserted and flushed.
Then existing workloads are executed against those mixed version
indexes. Compaction of mixed version indexes is also tested.

This commit does not add new test queries nor new datamodels, but
only reuses the existing tests we had that were executed against
a single version at a time.
…2150)

### What is the issue
CNDB-16135: Descriptor uses different instances of `Path directory` for
the same table as constructor parameter

### What does this PR fix and why was it fixed

Use the table directory from `Directories` in `Descriptor` as
constructor parameter . `Descriptor#directory` is still different
instance in C* due to `directory#toCanonical()` which always creates new
instance in local file system
### What is the issue
UCS settings files are not dropped after the table gets dropped. Instead
they are supposed to be cleared after the node restart. The cleanup is
faulty though and it prevents the node from startup.

Root Cause:
The cleanupControllerConfig() method in CompactionManager attempts to
verify if a table exists by calling getColumnFamilyStore(). When the
table is dropped, this method throws IllegalArgumentException, which was
not being caught. The existing catch block only handled
NullPointerException (for missing keyspace).

### What does this PR fix and why was it fixed
Extended the exception handler to catch both NullPointerException and
IllegalArgumentException, allowing orphaned controller-config.JSON files
to be properly identified and deleted during node restart.
We don't need all the counters in QueryContext to be LongAdder, since they are
only accessed by the query's thread.

They used to be plain longs until VECTOR-SEARCH-29 parallelized vector searches,
making them accessed by multiple threads. However, that parallelization was
removed by the BM25 patch.

The system properties USE_PARALLEL_INDEX_READ and PARALLEL_INDEX_READ_NUM_THREADS
are removed, since they were no-op since the introduction of BM25. Since we don't
reject unknown system properties, they will remain no-op.
Ports CASSANDRA-19483 which added an option to group invalid legacy
protocol magic exception with an no-spam logger.

The test was slightly modified, it contains the timeout for log watching
and ensure the logs start from the mark that happens before the
exception. It ensures the test does not hang without the fix it covers.

The motivation behind the port is to fix
https://datastax.jira.com/browse/DSP-24639
…ose rowIdIterator (#2183)

### What is the issue
Fixes: riptano/cndb#16336

### What does this PR fix and why was it fixed

In SAI, when we handle already created iterators, we need to close them
on the exceptional path. This is not the most aesthetic practice, but it
is the current one, so this change follows the current standard of
catching throwable, calling close, and throwing the throwable.

In this case, we hit an exception when reading from disk for `hasNext`
but there are at a couple spots there could be exceptions in there, so I
wrapped the whole block.
…tionGraph to index multiple columns (#2177)

### What is the issue
Fixes riptano/cndb#16252

### What does this PR fix and why was it fixed
Add a new method to the `IndexComponents.ForWrite` interface named
`tmpFileFor`. The new method is expected to create temporary files that
are namespaced to the index build. This fixes an issue in the
`CompactionGraph` where we incorrectly created temp files with the same
name in such a way that they collided. Now the files have the column
name and the build id, so they will be properly namespaced while
maintaining some form of meaningful name.
ULID-based SSTable ID generation can fail with an NPE when generating a new ID. The root cause is that the underlying ULID generator can generate an empty Optional when the clock is moved backwards to before the previously generated ID or in certain rare overflow conditions when timestamp collides. If it's our first time through the generation loop, we prematurely exit with a null newVal.

Top of the error stack:
```
java.lang.NullPointerException
	at org.apache.cassandra.utils.TimeUUID.approximateFromULID(TimeUUID.java:58)
	at org.apache.cassandra.io.sstable.ULIDBasedSSTableId.<init>(ULIDBasedSSTableId.java:52)
	at org.apache.cassandra.io.sstable.ULIDBasedSSTableId$Builder.lambda$generator$0(ULIDBasedSSTableId.java:129)
```
This can cause a flush to fail.

Continue looping until newVal gets a value. The loop can spin until the corrected time catches up to the time of the most recently used ULID generation ID. This should be a short duration in a healthy cluster without large time corrections from sync.

Tests are added in ULIDBasedSSTableIdGeneratorTest
A package-protected constructor is introduced for ULIDBasedSSTableIdGeneratorTest.testGeneratorRetryOnEmptyOptional()

Cassandra Applicability:
 upstream doesn't have ULIDBasedSSTableId (and won't because CASSANDRA-17048).
The patch in CNDB-15608 will remove token from the partition key
serialization, thus the case, when only token is provided in primary
key, needs to be treated separately. Currently only token primary key
case is blurred with complete or deferred primary keys.

This commit implements clear separation for primary keys with token only
by introducing own `TokenOnlyPrimaryKey` class.
It also implement `isOnlyToken()` method, which is used instead of
comparing partition key with null in appropriate cases (e.g., deferred
partition key was loaded). New tests are added to exercise
`TokenOnlyPrimaryKey` more and improve code coverage.
…2156)

### What is the issue

ConcurrentHashMap backing ChunkCache sometimes falls into Treeification
( O(log n) access time instead of O(1))

Full context in https://github.com/riptano/cndb/issues/16243

### What does this PR fix and why was it fixed

* Improve hashCode of the Key to prevent collisions
* Implement Comparable<Key> in case the ConcurrentHashMap falls back in
to Tree mode
…sort (#2184)

### What is the issue
Fixes: riptano/cndb#16344

### What does this PR fix and why was it fixed
When `ANN_USE_SYNTHETIC_SCORE` is true, we send the score from the
replica to the coordinator. We need that score to be the full precision
score to ensure correct ordering. When a query supplies `rerank_k: 0`,
the score is approximate though and can produce out of order results, as
the test demonstrates. The fix is to configure the
`NodeQueueRowIdIterator` appropriately to indicate when the score is in
fact approximate. The FP score is only computed when
`ANN_USE_SYNTHETIC_SCORE` is true.
### What is the issue
Exclusion of *.SF files from jar that get packaged into the dtest-jar
does not work as expected.
As a result if a jar contains the *.SF file, like for instance the
org.eclipse.jdt:ecj dependency, then the `ECLIPSE_.SF` file get packages
into the dtest-jar. It dependent projects like HCD, it results in
dtest-jar missing from classpath, which causes runtime exceptions.

### What does this PR fix and why was it fixed
This patch ports the dtest jar packaging code from C* 5, which uses an
approach of unpackaging all the jars, then filtering the files to remove
the undesired metadata and finally it repackages the classes into a fat
jar.

CNDB PR: riptano/cndb#16382
Add execution information to the log reports for aborted queries.

The extension of the log messages is identical to what we did for slow successful queries 
in CNDB-15260 and CNDB-15260, but applied to aborted queries. For regular queries this 
information is the number of fetched/returned/deleted partitions and rows. For SAI queries,
this information is the SAI query metrics and a tree view of the SAI query plan.

This also renames the CassandraRelevantProperties related to monitoring, 
to align them with ASF/CC5. The renames are:

SLOW_QUERY_LOG_MONITORING_REPORT_INTERVAL_IN_MS -> MONITORING_REPORT_INTERVAL_MS
SLOW_QUERY_LOG_MONITORING_MAX_OPERATIONS -> MONITORING_MAX_OPERATIONS
SLOW_QUERY_LOG_EXECUTION_INFO_ENABLED -> MONITORING_EXECUTION_INFO_ENABLED
SAI_SLOW_QUERY_LOG_EXECUTION_INFO_ENABLED -> SAI_MONITORING_EXECUTION_INFO_ENABLED

Please note that the real underlying system properties are unchanged, 
and this only renames their enum names.

The feature is toggled with the cassandra.monitoring_execution_info_enabled system property
for general queries, and cassandra.sai.monitoring_execution_info_enabled for SAI queries. 
The per-writer endpoints to enable/disable slow query logging are 
http://<HOST>:<PORT>/writer/api/v0/nodetool/cassandra_relevant_property?property=MONITORING_EXECUTION_INFO_ENABLED&new_value=<false|true>
for general queries, and http://<HOST>:<PORT>/writer/api/v0/nodetool/cassandra_relevant_property?property=SAI_MONITORING_EXECUTION_INFO_ENABLED&new_value=<false|true> for SAI queries
When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.

This change also allows to remove one estimation method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.

Additionally, the accuracy of estimating of NOT_EQ rows has been
improved by letting the planner know the union generated by NOT_EQ
is disjoint so the result set cardinality is the sum of cardinalities
of the subplans.

The commit contains also a fix for a bug that caused some
non-hybrid queries be counted as hybrid by the query metrics.

Unused keyRange parameters have been removed from the methods
for estimating row counts in the index classes.
…ut. (#2194)

### What is the issue
There are two different ways in which network timeouts are detected
during Repair operations--there is a configured count-down latch which
will fail if the requests take too long, and there is a separate timeout
returned by the underlying messaging system. Unfortunately, when the
underlying messaging system timed out, it was being treated as a general
network error instead of a timeout. The net result is that _very rarely_
the network will timeout before the count down latch, and some tests in
the CI build will fail with an incorrect error message.

### What does this PR fix and why was it fixed
This resolves riptano/cndb#14687. 

The main motivation is flaky unit tests, but the end user will also see
a more consistent error message in the event of a network timeout that
is detected at a lower level than the latch timeout.
…Written (#2181)

### What is the issue

riptano/cndb#16330

### What does this PR fix and why was it fixed

This removes the lock in `PartialLifecycleTransaction::trackNewWritten`
on the main transaction `trackNewWritten` as
`RemoteLogTransaction::trackNewWritten` performs the upload of SSTables
to remote storage which takes a long time and thus it leads to
contention and prevents parallel upload to remote storage.
…ost (#2189)

### What is the issue
Fixes: riptano/cndb#16350

### What does this PR fix and why was it fixed

ChronicleMap gives us several ways to use lower level APIs to avoid
deserializing keys/values and the associated allocation that comes with
them. The first key thing to mention is that iteration is very expensive
as these maps get big, so we want to avoid it if possible. The second is
that if we use the typical map iteration methods, they deserialize the
key and the value eagerly. Since the key is typically a high dimensional
vector, it is valuable to avoid such deserialization. This change:

* Removes unnecessary iteration leveraging the fact that compaction is
additive
* Replaces `forEach` with `forEachEntry`, which gives better semantics
* Updates the `maybeAddVector` method to avoid serializing the vector
key twice by using the `searchContext`. The `ChronicleMap#put` method
uses this pattern internally

I added two sets of benchmarks, however the `VectorCompactionBench`
doesn't seem to register the benefit of the ChronicleMap. I am leaving
`VectorCompactionBench` in place since it is still useful. Likely, this
is because ChronicleMap's cost isn't as expensive as graph construction.

Here are some of the benchmark results. They show between a 50x and 100x
improvement. The improvement seems to increase as we build larger
graphs.

benchmark results before change:

```
     [java] Benchmark                                                   (dimension)  (numVectors)  Mode  Cnt      Score     Error  Units
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768        100000  avgt    5    271.569 ±   3.473  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768       1000000  avgt    5   5452.393 ± 227.905  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536        100000  avgt    5   1392.607 ±  30.388  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536       1000000  avgt    5  11496.696 ± 345.886  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768        100000  avgt    5    242.049 ±  20.708  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768       1000000  avgt    5   2365.691 ±  84.173  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536        100000  avgt    5    265.395 ±   4.167  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536       1000000  avgt    5   3641.557 ± 130.649  ms/op
```
after change:

```
     [java] Benchmark                                                   (dimension)  (numVectors)  Mode  Cnt    Score    Error  Units
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768        100000  avgt    5    5.721 ±  1.727  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768       1000000  avgt    5  124.536 ± 22.464  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536        100000  avgt    5    5.662 ±  0.610  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536       1000000  avgt    5  122.671 ±  3.343  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768        100000  avgt    5    5.364 ±  1.194  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768       1000000  avgt    5  119.449 ±  4.809  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536        100000  avgt    5    5.379 ±  0.552  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536       1000000  avgt    5  121.293 ±  3.040  ms/op
```
…porting repair message timeouts. (#2185)

### What is the issue
A repair task in CNDB would hang forever if a timeout for validate or sync
request is reached. This was happening because the repair initiator (the
repair service in the primary region) would consider the peer repair
service as not supporting timeouts, and therefore it would not fail the
repair task, leaving it hanging. This behavior is due to the way C* used
to detect if a peer supports repair timeouts, which is by checking the
C* version of the peer in the `Nodes.peers()` table. In CNDB, though,
the repair services (or external services in general) are not added to
the peers as they are not handled by the ServiceTracker.

### What does this PR fix and why was it fixed
Add system property to always consider the remote peer supporting repair
message timeouts.
Right now the peer's version is checked to figure out if timeouts are
supported, but that doesn't work in CNDB, as repair services are not
added to the Nodes.peers() table, so there's no clear way to check what
version a remote peer is running on. The newly introduced system
property allows to skip this version check and always consider timeouts
supported by the remote peer. The default value of the property is `false`
to provide backward compatibility.
This commit adds new metrics related to the operation of SAI query
planner. The metrics should help checking if the query planner makes
proper decisions by correlating them with the other metrics,
e.g. the metrics of the actual query execution.

Per-query metrics (histograms):
- `RowsToReturnEstimated`: the estimated number of rows to be returned
  by the query
- `RowsToFetchEstimated`: the estimated number of rows the query
  is going to fetch from storage
- `KeysToIterateEstimated`: the estimated number of primary keys to
  read from the indexes when executing the query to completion
- `CostEstimated`: the abstract cost of query execution
- `LogSelectivityEstimated`: minus decimal logarithm of query
  selectivity,  before applying the query LIMIT (0 means the query
  selects all rows, 5 means it selects 10^(-5) = 0.00001 subset
  of rows)
- `IndexReferencesInQuery`: the number of index references in the
  unoptimized query execution plan (the same index may
  be referenced multiple times and counts separately)
- `IndexReferencesInPlan`: the number of index references in the
  optimized query execution plan (the same index may
  be referenced multiple times and counts separately)

Per-table:
- `TotalRowsToReturnEstimated`: the sum of all estimates of returned
  rows from all completed queries
- `TotalRowsToFetchEstimated`: the sum of all estimates of fetched
  rows from all completed queries
- `TotalKeysToIterateEstimated`: the sum of all estimates of iterated
  primary keys from all completed queries
- `TotalCostEstimated`: counts the sum of all cost estimates from
  all completed queries
…w id mapping (#2192)

PrimaryKeyMap interface provides important API to retrieve exact with
inverted ceiling, ceiling or floor Row IDs for given primary keys, which
are used in index searcher. One of the API methods were not checking all
out of range conditions. This bug does not affect current functionality,
however, is an issue for riptano/cndb#15608,
where it was discovered. Since there were no unit tests on the API, it
was difficult to discover the bug.

The bug in method `indexOf` of `AbstractBlockPackedReader` is fixed by
checking the out-of-bound condition on a non-inverted value.
Furthermore, since the same condition is repeated in several places, it
moved into a separate method to to help prevent re-introducing the bug.

Unit tests are added on primary key map API for exact with inverted
ceiling, ceiling and floor methods for partition and row aware versions, and
covers the bug case. `MapWalker` helper classes are introduced in each
unit test to make test cases more readable While they look very similar,
they defer due to differences in the test databases and behaviors
between partition and row aware formats and between skinny and wide
tables.

Added unit tests are also helpful to prevent regressions during changes
in implementations of primary key map APIs. Such changes are being
implemented in riptano/cndb#15608.
### What is the issue
Fixes: riptano/cndb#16484

### What does this PR fix and why was it fixed

When a partition restriction filters down to rows without vectors, we
can get a list of 0 rows, and then we attempt to create a bounded heap
of size 0, which is invalid. The added test catches this error. The
solution is to exit the method early in the event of a zero length
`SegmentRowIdOrdinalPairs` object.

```
java.lang.IllegalArgumentException: initialSize must be > 0 and < 2147483630; got: 0
	at io.github.jbellis.jvector.util.AbstractLongHeap.<init>(AbstractLongHeap.java:52)
	at io.github.jbellis.jvector.util.BoundedLongHeap.<init>(BoundedLongHeap.java:47)
	at io.github.jbellis.jvector.util.BoundedLongHeap.<init>(BoundedLongHeap.java:43)
	at org.apache.cassandra.index.sai.disk.v2.V2VectorIndexSearcher.orderByBruteForce(V2VectorIndexSearcher.java:328)
	at org.apache.cassandra.index.sai.disk.v2.V2VectorIndexSearcher.orderByBruteForce(V2VectorIndexSearcher.java:283)
	at org.apache.cassandra.index.sai.disk.v2.V2VectorIndexSearcher.searchInternal(V2VectorIndexSearcher.java:246)
	at org.apache.cassandra.index.sai.disk.v2.V2VectorIndexSearcher.orderBy(V2VectorIndexSearcher.java:172)
	at org.apache.cassandra.index.sai.disk.v1.Segment.orderBy(Segment.java:160)
```
…upported in UI scripting for multibranch but IBM TAAS needs it in the repository
@maheshrajamani maheshrajamani deleted the ds-jenkinsfile branch January 22, 2026 15:23
@maheshrajamani maheshrajamani restored the ds-jenkinsfile branch January 22, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.