Cassandra 18954 #2839

jacek-lewandowski · 2023-10-24T12:19:40Z

No description provided.

Members of the ClusterMetadataService (CMS) replicate the global log table, and are responsible for linearizing inserts into the the log. Log entries contain transformations which are applied to ClusterMetadata in the prescribed order. Log entries are replicated to non-CMS members only after being ordered and inserted into the log. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Schema itself is a component of ClusterMetadata and all DDL updates are applied by the CMS inserting a log entry containing a schema transformation. As the log entries are disseminated around the cluster, each peer applies the transformation to its local ClusterMetadata, enacting the schema change. This entails some changes to the way the database objects represented in schema are intialised (db objects refers to classes like Keyspace, ColumnFamilyStore, etc). Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Adds a new Directory component to ClusterMetadata to manage member identity, state location and addressing. This duplicates some of the functions of TokenMetadata, Topology et al but with updates performed consistently via the global log. Although it isn't actually used for anything yet it is a prerequisite for managing data ownership through TCM, which will eventually replace TokenMetadata completely. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Introduce new classes for representing placement of data ranges on replicas, along with the movement of data via transitions from one placement to the next. Eventually, these placements will be statically calculated in response to events with alter either the topology of the cluster (i.e. adding/removing/moving nodes) or the replication profile of the data itself (i.e. creating/altering keyspaces). These triggering events will be distributed and enacted consistently using the global log. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Minimal modifications to AbstractReplicationStrategy implementations to support the production of DataPlacements using ClusterMetadata while retaining calculateNaturalReplicas. Also adds tests to compare the output of both methods and assert their equivalence. Eventually, the original implementations based on TokenMetadata will be retired and will be retained in the test source to guard against regressions. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Introduces transformations to modify ClusterMetadata with the affect of modifying data ownership and placement i.e. join, replace, move & decommission. These operations do not simply modify metadata however, so simple atomic updates to cluster metadata are not sufficient. Streaming data in and out of nodes must also occur and obviously read and write operations must continue whilst this is in progress. These operations then are performed in phases, planned in advance and properly linearized using the global log. e.g. to join an new node, the full set of phased range movements required is calculated, generating an actionable plan which is then serialised into ClusterMetadata. Concurrent operations are permitted as long as they only affect disjoint token ranges, ensuring that concurrent range movements remain safe and cluster invariants are preserved at all times, including in the case of the failure to complete any operation (i.e. failed bootstrap). This commit only adds the transformations and supporting TCM components (LockedRanges, InProgressSequences etc), the implementation of actually performing the operations follows in subsequent commits. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

WIP commit (i.e. does not compile) beginning the process of removing gossip as the source of truth regarding membership, ownership, topology and data placement. This task will be split over mutiple commits. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

WIP commit (i.e. does not compile) replacing initial toy implementation of CMS membership with proper implementation. Membership of the CMS is determined by ownership of keyspaces with the META replication strategy (more precisely, by being a member of the _read_ placements for meta strategy keyspaces, a node is considered a member of the CMS). Also implements more of the "real" [pre]initialization of the CMS, in preparation for supporting upgrading a running cluster from a gossip based system. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 1 of 7 commits applying the main changes migrate StorageService away from managing state using TokenMetadata with updates propagated using gossip. This commit makes the initial bulk changes to StorageService itself and thoroughly breaks compilation. Subsequent commits in the series fix the main build before test code is updated later. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 2 of 7 moves most of the data placement/ownership code over to the TCM structures. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 3 of 7 adds the ability to detect version mismatches between peers on the read/write path and to handle such divergence. Lagging peers will attempt to catch up from the CMS if the coordinator in a r/w operation has seen newer metadata. Coordinators may fail writes if the cluster metadata changes while the write is in flight, if the consistency level can no longer be satisfied by the original replica plan. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 4 of 7 modifications to ColumnFamilyStore, mostly related to: * ShardBoundaries * DiskBoundaries Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 5 of 7 only compilation errors in non-test code are directly related to TokenMetadata Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 6 of 7 Completely remove TokenMetadata, the intention is to bring it back in a stripped down form, available to tests only, so we can continue to verify equivalence between old and new code. Test code is still extremely broken at this point, but non-test code is buildable again, though almost certainly not actually runnable. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Part 7 of 7 brings StorageService.operationMode back into sync with previous behaviour. Many external coordination tools depend on accessing this state via JMX, so this is an important external interface. This commit also adds a virtual version of the system.local table, as we can fully construct the data for this from ClusterMetadata, meaning we no longer the on-disk system table, though this is retained for now. In future, more system tables can be virtualised (system.peers, system_schema, etc). Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Adds new nodetool commands to: * list members of the CMS * initiate a snapshot of ClusterMetadata via submitting a SealPeriod operation Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Adds a handful of implementations to subclasses in the org.apache.cassandra.utils.concurrent package Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Adds a property for use in tests and debugging which preserves the stacktrace of when a thread is created by NamedThreadFactory. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Following an upgrade, nodes in an existing cluster will enter a minimal modification mode. In this state, the set of allowed cluster metadata modifications is constrained to include only the addition, removal and replacement of nodes, to allow failed hosts to be replaced during the upgrade. In this mode the CMS has no members and each peer maintains its own ClusterMetadata independently. This metadata is intitialised at startup from system tables and gossip is used to propagate the permitted metadata changes. When the operator is ready, one node is chosen for promotion to the initial CMS, which is done manually via nodetool. At this point, the candidate node will propose itself as the initial CMS and attempt to gain consensus from the rest of the cluster. If successful, it verifies that all peers have an identical view of cluster metadata and initialises the distributed log with a snapshot of that metadata. Once this process is complete all future cluster metadata updates are performed via the CMS using the global log and reverting to the previous method of metadata management is not supported. Further members can and should be added to the CMS via the nodetool command. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Minimal changes to IEndpointSnitch implementations to have them pull location info from Directory. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Alter CassandraDaemon intialization to accomodate TCM and replay of the cluster metadata log. This is something of a WIP and there is clearly scope to further clean up this part of the code. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Updates the existing unit and and dtests to work with TCM. In the vast majority of cases, this just means changes to initialization or to slightly updated method signatures. In CEP-21 generally, the intention has been not to modify existing public interfaces at all and to limit any changes to code on the boundaries of internal subsystems. In addition, care has been taken to only make minimal modifications to existing tests, and to preserve their invariants. So although this commit is fairly large in terms of number of files, it's changes should be semantically quite light. Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18409

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18410

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18412

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18414

…r decommission patch by Alex Petrov; reviewed by Marcus Eriksson and Sam Tunnicliffe for CASSANDRA-18416

…lterSchema transformations Only when a coordinator is preparing to submit an AlterSchemaStatement to the CMS.

…e RF

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org> Co-authored-by: Marcus Eriksson <marcus_eriksson@apple.com>

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>

…ntly on CI, it consistently fails locally.

… startup * Don't try to connect to them with StartupClusterConnectivityChecker * Don't pre-emptively mark them as DOWN in Gossiper::waitToSettle

Co-authored-by: Sam Tunnicliffe <samt@apache.org>

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

…shadow round

…once agreed)

jacek-lewandowski · 2023-10-24T12:25:05Z

src/java/org/apache/cassandra/tcm/transformations/AlterSchema.java

            // pause capture and resume after in applying the schema change.
-            schemaTransformation.enterExecution();
+            if (!isReplay)
+                schemaTransformation.enterExecution();


I know I screwed this; will fix it

beobal and others added 30 commits September 29, 2023 17:26

[CEP-21] Include current epoch in internode header

5c78676

Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

[CEP-21] Test / build config changes

cc520f6

Co-authored-by: Marcus Eriksson <marcuse@apache.org> Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

TMP - use bundled version of harry

fb1c223

[CEP-21] Correctly represent bootstrapping nodes in StorageService

58c7d85

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18409

[CEP-21] Fix nodetool ring and effective ownership

acac57d

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18410

[CEP-21] Secondary indexes should not be rebuilt on restart

a21ebd2

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18412

[CEP-21] Re-enable stdout/sterr redirection at startup

b07681e

patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe for CASSANDRA-18414

[CEP-21] Ensure that global log replication factor is maintained afte…

fce7b46

…r decommission patch by Alex Petrov; reviewed by Marcus Eriksson and Sam Tunnicliffe for CASSANDRA-18416

beobal and others added 23 commits September 29, 2023 17:30

[CEP-21] Don't trigger client warnings or guardrails when executing A…

ce7511f

…lterSchema transformations Only when a coordinator is preparing to submit an AlterSchemaStatement to the CMS.

[CEP-21] Remove redundant Keyspaces arg from SchemaTransformation::apply

d036718

[CEP-21] Handle case where removenode requires no streaming to restor…

47769f0

…e RF

[CEP-21] Implement versioning for ranges

ef1ad17

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org> Co-authored-by: Marcus Eriksson <marcus_eriksson@apple.com>

[CEP-21] Retry indefinitely for STARTUP messages.

5924404

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>

[CEP-21] Fix flaky distributed log test. While it fails very infreque…

6ea5042

…ntly on CI, it consistently fails locally.

[CEP-21] Remove LEFT peers from system tables and exclude them during…

2aef6f6

… startup * Don't try to connect to them with StartupClusterConnectivityChecker * Don't pre-emptively mark them as DOWN in Gossiper::waitToSettle

[CEP-21] fix nodetool bootstrap resume

2265c4b

Co-authored-by: Sam Tunnicliffe <samt@apache.org>

[CEP-21] Implement replacement with same address

c1e94aa

Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com> Co-authored-by: Sam Tunnicliffe <samt@apache.org>

[CEP-21] Upgrading a one node cluster to TCM fails attempting Gossip …

8bcd581

…shadow round

[CEP-21] serialize MemtableParams when writing TableParams

e81ddb4

[CEP-21] remove authsetup

e5a8ac2

[CEP-21] fix cqlshlib tests

a90cc1b

update dtest repo for cci

8a216a0

[CEP-21] CASSANDRA-18816 rebase fixes

4d06d73

[CEP-21] fix GossiperTest - this test now matches trunk

4d6aab0

Add implementation overview doc

357ca7c

Use pinned Harry version

97bf0f1

Add isReplay parameter to the SchemaTransformation.apply method

b6e5ae5

Update AlterTableStatement (example - will apply to other statements …

1787632

…once agreed)

Add isReplay parameter to Transformation.execute method

5152b24

compute isReplay in the local log

e69e952

wip - fix the data-loss problem when dropping/creating columns

2fcc6ae

jacek-lewandowski commented Oct 24, 2023

View reviewed changes

asfgit force-pushed the cep-21-tcm branch 2 times, most recently from 2211ddf to ece8e96 Compare November 22, 2023 21:04

asfgit force-pushed the cep-21-tcm branch 2 times, most recently from c6a6822 to 1c5c548 Compare November 24, 2023 09:16

asfgit deleted the branch apache:cep-21-tcm March 27, 2024 13:25

asfgit closed this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cassandra 18954 #2839

Cassandra 18954 #2839

Uh oh!

jacek-lewandowski commented Oct 24, 2023

Uh oh!

jacek-lewandowski Oct 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Cassandra 18954 #2839

Cassandra 18954 #2839

Uh oh!

Conversation

jacek-lewandowski commented Oct 24, 2023

Uh oh!

jacek-lewandowski Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants