Skip to content

Fix LWT prepared statement routing failure in PRESERVE_REPLICA_ORDER mode#2084

Closed
dkropachev wants to merge 415 commits intoapache:4.xfrom
scylladb:fix-lwt-prepared-statement-routing-830
Closed

Fix LWT prepared statement routing failure in PRESERVE_REPLICA_ORDER mode#2084
dkropachev wants to merge 415 commits intoapache:4.xfrom
scylladb:fix-lwt-prepared-statement-routing-830

Conversation

@dkropachev
Copy link
Copy Markdown
Contributor

Summary

Fixes issue #830 where LWT prepared statements fail with "No node was available" error when using PRESERVE_REPLICA_ORDER routing method (default in 4.19.0.5).

The root cause was that the new LWT routing logic returned empty query plans when replica information wasn't available (common for prepared statements before parameter binding).

Changes

Core Fix

  • Enhanced newQueryPlanPreserveReplicas to implement proper node prioritization:

    1. Local DC replicas (order preserved)
    2. Remote DC replicas (order preserved)
    3. Local DC non-replicas (rotated by routing key or randomly)
    4. Remote DC non-replicas (rotated by routing key or randomly)
  • Added proper handling for both localDc == null and localDc != null cases

  • Implemented routing-key-based consistent rotation for deterministic behavior

  • Replaced diceRoll1d4() with universal randomNextInt(bound) for better testability

Test Updates

  • Updated existing LWT routing tests to match new behavior expectations
  • Added controlled randomness testing using Mockito spy to mock randomNextInt()
  • Added tests for both consistent (with routing key) and random (without routing key) rotation
  • Updated all test references from diceRoll1d4() to randomNextInt(4)

Behavior Changes

Before: LWT queries with PRESERVE_REPLICA_ORDER could fail with empty query plans
After: LWT queries always include all available nodes with proper DC prioritization

Backward Compatibility

  • Non-breaking: Only adds nodes that weren't previously included
  • When replica information is available, behavior is enhanced but compatible
  • Maintains all existing performance optimizations
  • Configuration remains the same

Test Plan

Resolves #830

dkropachev and others added 30 commits September 9, 2024 11:25
* Pull CCM image before the test

* Remove RC images from test list
Log message notifying of closing the previous ControlConnection's channel
wrongly mentions current channel. Switch it to the previous instead.
* Changed Cassandra to ScyllaDB where applicable
* Updated `DefaultLoadBalancingPolicy` documentation
* Updated links to point to ScyllaDB documentations
Removes old Resolver implementations.
Removes Resolver "hooks" that were added in core code.
Adds org.burningwave:tools dependency as the preferred solution.
Adds MultimapHostResolver implementation allowing for one-to-many mapping of
hostnames to ips.
Adds MultimapHostResolverProvider as a replacement for ResolverProvider.
Adjusts MockResolverIT to use new replacement.
* Add `MockResolverIT#cannot_reconnect_with_resolved_socket()`

Adds a method for testing the issues that surface after cluster
replacements. Due to the variable, sometimes long runtime it is not added
to any of the test groups.

* Add optional fallback for `ControlConnection#reconnect()`

Adds an experimental option to allow `ControlConnection` to try
reconnecting to the original contact points held by `MetadataManager`,
in case of getting empty query plan from the load balancing policy.

In order to separate this logic from query plans of other queries
`LoadBalancingPolicyWrapper#newControlReconnectionQueryPlan()` was introduced
and is called during reconnection in place of `newQueryPlan()`.
docs: update workflows

fix: java version

fix: java version

fix: java version
Previously the tablet map would be completely emptied on node removal,
necessitating full rebuild of it. With this change the TabletMap will
be scanned through on node removals (`RemoveNodeRefresh`) in order to delete
only those tablets, that contain removed node as one of the replicas.

Additionally we introduce `TabletMapSchemaChangeListener` which will be
automatically registered on context initialization as long as schema metadata
is enabled. It won't be added if the schema metadata is enabled later at
runtime. This listener ensures that relevant tablets will be removed
on removals and updates of both keyspaces and tables.
The `TabletMapSchemaChangesIT` tests its behaviour.

Addresses #377.
Bumps [ch.qos.logback:logback-classic](https://github.com/qos-ch/logback) from 1.2.3 to 1.2.13.
- [Commits](qos-ch/logback@v_1.2.3...v_1.2.13)

---
updated-dependencies:
- dependency-name: ch.qos.logback:logback-classic
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [com.google.guava:guava](https://github.com/google/guava) from 25.1-jre to 32.0.0-jre.
- [Release notes](https://github.com/google/guava/releases)
- [Commits](https://github.com/google/guava/commits)

---
updated-dependencies:
- dependency-name: com.google.guava:guava
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [org.testng:testng](https://github.com/cbeust/testng) from 7.3.0 to 7.5.1.
- [Release notes](https://github.com/cbeust/testng/releases)
- [Changelog](https://github.com/testng-team/testng/blob/master/CHANGES.txt)
- [Commits](testng-team/testng@7.3.0...7.5.1)

---
updated-dependencies:
- dependency-name: org.testng:testng
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [org.json:json](https://github.com/douglascrockford/JSON-java) from 20230227 to 20231013.
- [Release notes](https://github.com/douglascrockford/JSON-java/releases)
- [Changelog](https://github.com/stleary/JSON-java/blob/master/docs/RELEASES.md)
- [Commits](https://github.com/douglascrockford/JSON-java/commits)

---
updated-dependencies:
- dependency-name: org.json:json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* CcmBridge methods adjustments

Adds `startWithArgs` methods that allow avoiding the default arguments like
"--wait-other-notice". This wait causes timeout in multi-dc test
utilizing `join_ring: false` so it needs to be avoided.

Adds `addWithoutStart(n, dc)`. As the name suggests it's an add that does not
immediately start the node.

Adds `updateNodeConfig` in order to allow adding new nodes with different
configurations.

* Add ZeroTokenNodesIT

Adds a test class verifying the behavior of the driver when dealing with
zero-token Scylla nodes.

Adds a configurable option which controls whether to allow zero-token nodes as
valid peers. Until now such nodes were considered invalid. Default value of
this option does not change the driver behavior.
…equest-timeout` (#404)

Updates the default value of `advanced.metadata.schema.request-timeout`
corresponding to `METADATA_SCHEMA_REQUEST_TIMEOUT` TypedDriverOption
to 20 seconds.

Extends the documentation with information about additional function in
Scylla clusters.
Adopt is not maintaned anymore
CI started picking up 2024.2.2 but the feature is not there yet.
This causes test failures.
Increasing minEnterprise to 2024.2.3.

Fixes #413
Previous version has a bug causing an IOException during
reading of /etc/os-release. See netty's issue 14479 for details.
* Fetch additional keyspace metadata from scylla_keyspaces

Extends schema related classes with methods for querying scylla_keyspaces
in order to determine whether keyspace is tablets-enabled.

Adjusts schema queries tests and several other integration tests
that mock responses from system tables.

Adds `isUsingTablets()` and `setUsingTablets()` to KeyspaceMetadata.

* Use keyspace metadata in `BasicLoadBalancingPolicy#getReplicas`

Modifies the getReplicas method to do the tablet lookup if and only if the
keyspace metadata indicates that it's a tablets-based keyspace. Otherwise refer
to the token map.
Previous behavior was to try tablet map lookup first regardless of the
keyspace configuration.
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.3 to 1.8.4.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](scylladb/sphinx-scylladb-theme@1.8.3...1.8.4)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Until now, the versions passed through ccm.version property would be
considered invalid if cannot be parsed to one of the known formats.
Now if the parsing fails, the property will be passed to ccm commands
"as is" which ccm will treat as directory name in local repository.
Numbered version will then be fetched through setting up a makeshift cluster
and asking ccm for it.
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.4 to 1.8.5.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](scylladb/sphinx-scylladb-theme@1.8.4...1.8.5)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The changes related to this test are not included in released
enterprise versions as of this moment.
It should be safe to increase minEnterprise requirement to 2025.1 for now.
)

Just like other drivers (Python, GoCQL), the Java 4.x driver can query with this simple WHERE clause added, to improve
query performance.

Fixes: #282

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Full scan requests slower even when only one record there is
Bumps [io.netty:netty-handler](https://github.com/netty/netty) from 4.1.116.Final to 4.1.118.Final.
- [Commits](netty/netty@netty-4.1.116.Final...netty-4.1.118.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-handler
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…thub.com/apache/cassandra-java-driver into scylla-4.x-merge-4.18.1

Following conflict resolutions and additional changes were made:
- Version was changed to 4.18.1.0-SNAPSHOT
- deep-lic-scan.yaml was removed. Likewise we don't have a fossa subscription.
- added with modifications previously skipped distribution-source module.
  Changed groupId, versions and names.
- added with modifications previously skipped distribution-tests module.
  Changed groupId, versions and names.
- Skipped license-maven-plugin which was checking for ASF license headers.
- removed `<oldArtifacts>` section of revapi plugin configuration pointing to
  datastax.
- Skipped addition of maven-remote-resources-plugin (pom.xml).
- Skipped removal of distributionManagement section (pom.xml). We are not using
  `org.apache` parent pom to define that part.
- Switched Scylla specific uses of `AttributeKey.newInstance` to `.valueOf` too
- Added `WHERE key='local'` to `system.local` queries in `DefaultTopologyMonitorTest`
- Removed irrelevant to Scylla version of the driver parts of the documentation
- Merged in translation between older and newer variations of config keys
  and values.
- Discarded some adjustments for upstream's Jenkins setup which conflicted
  with adjustments for our CI.
- Added extra stubs in BasicLoadBalancingPolicyPreferredRemoteDcsTest
  to match Scylla's modifications.
nikagra and others added 28 commits January 27, 2026 14:33
- Fix Javadoc positioning: Move @nonnull annotations after doc comments
  in DefaultLoadBalancingPolicy methods (per Java conventions)
- Add missing @nonnull annotation to StatementBuilderTest mock builder
- Add @nullable annotation to NodeStateIT query plan method signature
- Standardize test infrastructure:
  * Add @RunWith(MockitoJUnitRunner.Silent.class) to 7 test classes
  * Update LoadBalancingPolicyTestBase to stub LWT routing config option
  * Convert base class from @RunWith to abstract (subclasses now declare runner)
- Standardize integration test naming: ccmRule→CCM_RULE, sessionRule→SESSION_RULE
- Update test mocks with RequestRoutingType.REGULAR parameter for compatibility
- Improve LWT integration tests:
  * BatchStatementIT: Fix variable references, enhance LWT batch assertions
  * LWTLoadBalancingIT: Change from single-node to replica-set validation
  * Add LWTLoadBalancingMultiDcIT: New multi-DC LWT routing test coverage

No functional changes to production code—purely code quality and test improvements.

Apply suggestions from code review

Co-authored-by: Dmitry Kropachev <dmitry.kropachev@gmail.com>
commons-lang3 was pulled in transitively via the optional gremlin-core
dependency but used directly in production code (DefaultLoadBalancingPolicy).
Since gremlin-core is optional, consumers would not get commons-lang3
transitively, risking NoClassDefFoundError at runtime.

Replace Pair<Integer, Integer> with int[] in moveReplicasToFront() and
ArrayUtils.addAll() with standard Arrays.copyOf()/System.arraycopy()
in VectorCodecTest.
Fix pre-existing xml-format-maven-plugin check failure.
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.4 to 0.3.7.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Extract hardcoded threadCountClasses=8 into a configurable Maven
property test.parallel.threads. Default remains 8 (no behavior change).
CI can now override with -Dtest.parallel.threads=16.
Most integration tests are I/O-bound (waiting on CCM/Scylla), not
CPU-bound, so higher concurrency helps reduce total test time.
Add test-group matrix (parallelizable, serial, isolated) to both
Cassandra and Scylla integration test jobs in GitHub Actions. Each
group runs in a separate parallel job, reducing wall-clock time from
the sum of all three test phases to the duration of the slowest one.

- Add MAVEN_EXTRA_ARGS support to Makefile integration test targets
- Pass -DskipSerialITs/-DskipParallelizableITs/-DskipIsolatedITs via
  the workflow matrix to select which test group each job runs
- Update artifact names and check names to include test-group
- Remove unused run-tests.sh changes (script is not used by any CI)
Replace CustomCcmRule.builder().build() with CcmRule.getInstance() in
4 test classes that use zero customization, avoiding unnecessary cluster
creation/destruction per test class. Add @category(ParallelizableTests.class)
as required by CcmRule.

Skipped ExecutionInfoWarningsIT because it requires custom Cassandra
configuration (batch_size_warn_threshold_in_kb).
This test uses a mocked Mockito log appender to verify no resource
leaks occur. When run in parallel, other tests' log output contaminates
the appender, causing spurious NeverWantedButInvoked failures.
It must remain isolated (separate JVM fork).
Add @category(ParallelizableTests.class) to 7 integration tests that are
safe to run in parallel:

- LineStringIT, PointIT, PolygonIT: DSE geometry tests using shared
  CcmRule.getInstance() with their own keyspace
- DseAggregateMetadataIT, DseFunctionMetadataIT: DSE schema metadata
  tests using shared CcmRule.getInstance() with their own keyspace
- DriverExecutionProfileReloadIT: Uses isolated SimulacronRule
- DefaultRetryPolicyIT: Uses isolated SimulacronRule

Tests kept serial (not reclassified):
- All graph tests: use CustomCcmRule with DSE graph workloads
- Auth tests: use CustomCcmRule with authentication config
- SSL tests: use CustomCcmRule with SSL config
- Token/metadata tests: use CcmRule.builder() with custom partitioners
- Session add/remove node tests: manipulate CCM nodes
- Metrics tests: explicitly marked as not parallelizable due to
  unsynchronized access to AbstractMetricUpdater.MIN_EXPIRE_AFTER
- HeartbeatDisabledIT: explicitly marked as not parallelizable
- Delete*IT: use CustomCcmRule
DefaultRetryPolicyIT verifies log messages via a mocked appender, which
gets contaminated by other tests running in parallel, causing spurious
TooManyActualInvocations / NeverWantedButInvoked failures.
…e copilot instructions with formatting and build guidelines 🤖
Extract hardcoded semaphore permit count (5) in SchemaChangeSynchronizer
into a configurable system property test.max.concurrent.ddl. This allows
tuning DDL concurrency when increasing parallel test thread count,
without modifying source code.
The test-integration-* Makefile targets previously ran `mvn verify`
across all modules, re-compiling everything unnecessarily. Now we
first install all modules to the local repo with -DskipTests, then
run verify only on the integration-tests module with -pl.
The workflow never passed SCYLLA_VERSION from the matrix to the
resolve-scylla-version make target, so all three Scylla IT variants
(LTS-LATEST, LTS-PRIOR, LATEST) silently ran against the LATEST
version due to the Makefile default.

Fix broken dollar-sign escaping in the LTS-PRIOR scylla-enterprise
fallback regex in the Makefile ($  -> $$), which prevented resolution
of enterprise-only LTS versions.

Additionally, fix the enterprise tablets version threshold from
2023.1.0 to 2024.2.0 in DefaultLoadBalancingPolicyIT and
LWTLoadBalancingMultiDcIT. Tablets were introduced in ScyllaDB
Enterprise 2024.2.0, not 2023.1.0.
…IT to reuse sessions across tests

Convert @rule to @ClassRule for CcmRule and SessionRule so that the
CCM cluster and CQL session are created once per test class instead of
once per test method. This eliminates repeated cluster startup/teardown
overhead across all tests in these classes.

For PreparedStatementIT specifically:
- Replace DROP+CREATE TABLE with TRUNCATE between tests that don't
  modify schema, since most tests only need a clean table
- Track schema-altering tests (ALTER TABLE, DROP TABLE) with a flag
  so the table is only recreated when actually needed
- Use regular session.execute() for INSERT statements instead of
  routing them through the slow DDL execution profile
- Use try-with-resources in cross-session test to avoid prepare cache
  interference from the shared session

Retain CcmRule as a per-method @rule so that @BackendRequirement,
@ScyllaOnly, and @ScyllaSkip annotations are still evaluated per test.
Replace two identical 26-line Thread.sleep(1000) polling loops with
a shared waitForAllNodesUp() helper using Awaitility (500ms poll
interval, 20s timeout). Reduces timeout from 60s to 20s and uses
idiomatic Awaitility consistent with other integration tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mode

Addresses issue #830 where LWT prepared statements fail with "No node was available"
error when using PRESERVE_REPLICA_ORDER routing (default in 4.19.0.5).

Changes:
- Enhanced newQueryPlanPreserveReplicas to include non-replica nodes with proper DC prioritization
- Added fallback logic to prevent empty query plans for prepared statements
- Implemented routing-key-based consistent rotation for non-replica nodes
- Replaced diceRoll1d4() with universal randomNextInt(bound) method
- Updated tests to validate new behavior including randomness testing

The fix maintains replica order preservation while ensuring query plans always
contain nodes, resolving the regression introduced in 4.19.0.5.
@dkropachev
Copy link
Copy Markdown
Contributor Author

Wrong repository - moving to scylladb/java-driver

@dkropachev dkropachev closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants