Skip to content

Conversation

@jacek-lewandowski
Copy link
Contributor

No description provided.

beobal and others added 30 commits September 29, 2023 17:26
Members of the ClusterMetadataService (CMS) replicate the global log
table, and are responsible for linearizing inserts into the the log.
Log entries contain transformations which are applied to ClusterMetadata
in the prescribed order. Log entries are replicated to non-CMS members
only after being ordered and inserted into the log.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Schema itself is a component of ClusterMetadata and all DDL updates are
applied by the CMS inserting a log entry containing a schema
transformation. As the log entries are disseminated around the cluster,
each peer applies the transformation to its local ClusterMetadata,
enacting the schema change. This entails some changes to the way the
database objects represented in schema are intialised (db objects refers
to classes like Keyspace, ColumnFamilyStore, etc).

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Adds a new Directory component to ClusterMetadata to manage member
identity, state location and addressing. This duplicates some of the
functions of TokenMetadata, Topology et al but with updates performed
consistently via the global log. Although it isn't actually used for
anything yet it is a prerequisite for managing data ownership through
TCM, which will eventually replace TokenMetadata completely.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Introduce new classes for representing placement of data ranges on
replicas, along with the movement of data via transitions from one
placement to the next. Eventually, these placements will be statically
calculated in response to events with alter either the topology of the
cluster (i.e. adding/removing/moving nodes) or the replication profile
of the data itself (i.e. creating/altering keyspaces). These triggering
events will be distributed and enacted consistently using the global log.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Minimal modifications to AbstractReplicationStrategy implementations to
support the production of DataPlacements using ClusterMetadata while
retaining calculateNaturalReplicas. Also adds tests to compare the
output of both methods and assert their equivalence. Eventually, the
original implementations based on TokenMetadata will be retired and
will be retained in the test source to guard against regressions.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Introduces transformations to modify ClusterMetadata with the affect of
modifying data ownership and placement i.e. join, replace, move &
decommission. These operations do not simply modify metadata however,
so simple atomic updates to cluster metadata are not sufficient.
Streaming data in and out of nodes must also occur and obviously read
and write operations must continue whilst this is in progress. These
operations then are performed in phases, planned in advance and properly
linearized using the global log. e.g. to join an new node, the full set
of phased range movements required is calculated, generating an
actionable plan which is then serialised into ClusterMetadata.
Concurrent operations are permitted as long as they only affect disjoint
token ranges, ensuring that concurrent range movements remain safe and
cluster invariants are preserved at all times, including in the case of
the failure to complete any operation (i.e. failed bootstrap).

This commit only adds the transformations and supporting TCM components
(LockedRanges, InProgressSequences etc), the implementation of actually
performing the operations follows in subsequent commits.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
WIP commit (i.e. does not compile) beginning the process of removing
gossip as the source of truth regarding membership, ownership, topology
and data placement. This task will be split over mutiple commits.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
WIP commit (i.e. does not compile) replacing initial toy implementation
of CMS membership with proper implementation. Membership of the CMS is
determined by ownership of keyspaces with the META replication strategy
(more precisely, by being a member of the _read_ placements for meta
strategy keyspaces, a node is considered a member of the CMS).

Also implements more of the "real" [pre]initialization of the CMS, in
preparation for supporting upgrading a running cluster from a gossip
based system.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 1 of 7 commits applying the main changes migrate StorageService
away from managing state using TokenMetadata with updates propagated
using gossip.

This commit makes the initial bulk changes to StorageService itself and
thoroughly breaks compilation. Subsequent commits in the series fix the
main build before test code is updated later.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 2 of 7 moves most of the data placement/ownership code over to the
TCM structures.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 3 of 7 adds the ability to detect version mismatches between peers
on the read/write path and to handle such divergence. Lagging peers will
attempt to catch up from the CMS if the coordinator in a r/w operation
has seen newer metadata. Coordinators may fail writes if the cluster
metadata changes while the write is in flight, if the consistency level
can no longer be satisfied by the original replica plan.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 4 of 7 modifications to ColumnFamilyStore, mostly related to:
* ShardBoundaries
* DiskBoundaries

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 5 of 7 only compilation errors in non-test code are directly
related to TokenMetadata

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 6 of 7 Completely remove TokenMetadata, the intention is to bring
it back in a stripped down form, available to tests only, so we can
continue to verify equivalence between old and new code.

Test code is still extremely broken at this point, but non-test code is
buildable again, though almost certainly not actually runnable.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Part 7 of 7 brings StorageService.operationMode back into sync with
previous behaviour. Many external coordination tools depend on accessing
this state via JMX, so this is an important external interface.

This commit also adds a virtual version of the system.local table, as we
can fully construct the data for this from ClusterMetadata, meaning we no
longer the on-disk system table, though this is retained for now. In
future, more system tables can be virtualised (system.peers,
system_schema, etc).

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Adds new nodetool commands to:
* list members of the CMS
* initiate a snapshot of ClusterMetadata via submitting a SealPeriod
  operation

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Adds a handful of implementations to subclasses in the
org.apache.cassandra.utils.concurrent package

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Adds a property for use in tests and debugging  which preserves
the stacktrace of when a thread is created by NamedThreadFactory.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Following an upgrade, nodes in an existing cluster will enter a minimal
modification mode. In this state, the set of allowed cluster metadata
modifications is constrained to include only the addition, removal and
replacement of nodes, to allow failed hosts to be replaced during the
upgrade.

In this mode the CMS has no members and each peer maintains its
own ClusterMetadata independently. This metadata is intitialised at
startup from system tables and gossip is used to propagate the permitted
metadata changes.

When the operator is ready, one node is chosen for promotion to the initial
CMS, which is done manually via nodetool. At this point, the candidate node
will propose itself as the initial CMS and attempt to gain consensus from
the rest of the cluster. If successful, it verifies that all peers have an
identical view of cluster metadata and initialises the distributed log with
a snapshot of that metadata.

Once this process is complete all future cluster metadata updates are performed
via the CMS using the global log and reverting to the previous method of
metadata management is not supported. Further members can and should be added
to the CMS via the nodetool command.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Minimal changes to IEndpointSnitch implementations to have them pull
location info from Directory.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Alter CassandraDaemon intialization to accomodate TCM and replay of the
cluster metadata log. This is something of a WIP and there is clearly
scope to further clean up this part of the code.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Updates the existing unit and and dtests to work with TCM. In the vast
majority of cases, this just means changes to initialization or to
slightly updated method signatures.

In CEP-21 generally, the intention has been not to modify existing
public interfaces at all and to limit any changes to code on the
boundaries of internal subsystems. In addition, care has been taken to
only make minimal modifications to existing tests, and to preserve their
invariants. So although this commit is fairly large in terms of
number of files, it's changes should be semantically quite light.

Co-authored-by: Marcus Eriksson <marcuse@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe
for CASSANDRA-18409
patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe
for CASSANDRA-18410
patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe
for CASSANDRA-18412
patch by Marcus Eriksson; reviewed by Alex Petrov and Sam Tunnicliffe
for CASSANDRA-18414
…r decommission

patch by Alex Petrov; reviewed by Marcus Eriksson and Sam Tunnicliffe
for CASSANDRA-18416
beobal and others added 23 commits September 29, 2023 17:30
…lterSchema transformations

Only when a coordinator is preparing to submit an AlterSchemaStatement
to the CMS.
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Co-authored-by: Marcus Eriksson <marcus_eriksson@apple.com>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
… startup

* Don't try to connect to them with StartupClusterConnectivityChecker
* Don't pre-emptively mark them as DOWN in Gossiper::waitToSettle
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
Co-authored-by: Alex Petrov <oleksandr.petrov@gmail.com>
Co-authored-by: Sam Tunnicliffe <samt@apache.org>
// pause capture and resume after in applying the schema change.
schemaTransformation.enterExecution();
if (!isReplay)
schemaTransformation.enterExecution();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I screwed this; will fix it

@asfgit asfgit force-pushed the cep-21-tcm branch 2 times, most recently from 2211ddf to ece8e96 Compare November 22, 2023 21:04
@asfgit asfgit force-pushed the cep-21-tcm branch 2 times, most recently from c6a6822 to 1c5c548 Compare November 24, 2023 09:16
@asfgit asfgit deleted the branch apache:cep-21-tcm March 27, 2024 13:25
@asfgit asfgit closed this Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants