(WIP) CDC reader in Apache Cassandra Sidecar - CASSANDRASC-27 #18

tharanga · 2020-10-20T22:03:27Z

This is a WIP version of a Cassandra change stream emitter based on the CDC feature of Cassandra 4.0-beta2.

New dependencies:

Cassandra 4.0-beta2 Jar

New config:

cdc: configPath:Path to the Cassandra server config file

Pre-read:
https://cassandra.apache.org/doc/latest/operating/cdc.html

How to use:

Enable CDC in Cassandra through cassandra.yaml : cdc_enabled: true
Set commitlog_sync_period_in_ms: 10000 to a value on how quickly you want to observe changes (100ms lower limit)
Enable CDC on a table ALTER TABLE <your table> WITH cdc=true;
Change sidecar config cdc: configPath: to point to the cassandra.yaml
Start the sidecar, insert data into the CDC enabled table and you'll see changes are emitted to the log

Current limitations:

Restart the sidecar upon schema changes
Other unknown bugs due to the absence of unit tests

Tasks of the initial version:

Maxwell-Guo · 2020-10-27T06:48:53Z

src/main/dist/conf/sidecar.yaml

@@ -24,3 +24,6 @@ sidecar:

 healthcheck:
  - poll_freq_millis: 30000
+
+cdc:


I think this can change from "cdc" to "cassandra config file" for we may got some other useage of cassandra yaml path not only cdc .

and I think this config path should be put to "cassandra:" which of the top choice of the sidecar.yaml

I'd wait for a general need and then refactor it out of the cdc section. We can do it now if there's such a need.

src/main/java/org/apache/cassandra/sidecar/CQLSession.java

src/main/java/org/apache/cassandra/sidecar/Configuration.java

Maxwell-Guo · 2020-10-27T08:48:46Z

src/main/java/org/apache/cassandra/sidecar/MainModule.java

@@ -151,6 +163,7 @@ public Configuration configuration() throws ConfigurationException, IOException
                    .setTrustStorePath(yamlConf.get(String.class, "sidecar.ssl.truststore.path", null))
                    .setTrustStorePassword(yamlConf.get(String.class, "sidecar.ssl.truststore.password", null))
                    .setSslEnabled(yamlConf.get(Boolean.class, "sidecar.ssl.enabled", false))
+                    .setCassandraConfigPath(yamlConf.get(String.class, "cdc.configPath"))


as I said before, the cassandra configre path can also be for cassandra choice at the sidecar.yaml。

src/main/java/org/apache/cassandra/sidecar/cdc/CDCBookmark.java

Maxwell-Guo · 2020-10-27T12:23:47Z

src/main/java/org/apache/cassandra/sidecar/CassandraSidecarDaemon.java

    }

    public void start()
    {
        banner(System.out);
        validate();
        logger.info("Starting Cassandra Sidecar on {}:{}", config.getHost(), config.getPort());
+        cdcReaderService.start();


add a log for cdc reader service?

You mean a log saying the CDCReaderService started? There is such a log statement in that class: logger.info("Successfully started the CDC reader");

Maxwell-Guo · 2020-10-28T09:35:14Z

src/main/java/org/apache/cassandra/sidecar/CassandraSidecarDaemon.java

    }

    public void start()
    {
        banner(System.out);
        validate();
        logger.info("Starting Cassandra Sidecar on {}:{}", config.getHost(), config.getPort());
+        cdcReaderService.start();


Besides , do you think we should add a flag to enable or disable the cdc reader service?

Good suggestion. Let me add that, so users who don't need this can just keep it disabled.

Maxwell-Guo · 2020-10-28T09:45:23Z

src/main/java/org/apache/cassandra/sidecar/CassandraSidecarDaemon.java

    }

    public void start()
    {
        banner(System.out);
        validate();
        logger.info("Starting Cassandra Sidecar on {}:{}", config.getHost(), config.getPort());
+        cdcReaderService.start();


and I also think we can add a common method where all other service can be added inner the method .
such as startInitService() , for me ,I think cdcReaderService is a sidecar init start service and we can use a flag to
show the service's start or not .

Maxwell-Guo · 2020-10-28T09:58:43Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+                logger.info("Waiting for Cassandra server to start. Retrying after {} milliseconds",
+                        retryIntervalMs);
+                Thread.sleep(retryIntervalMs);
+                retryIntervalMs *= 2;


do you think we should add some retry limit ? in this case ,the program will always try to connect to cassandra.
I think we can add a default retry time such as 3, and init retryIntervalMs can be 10ms, we can retry 3 time ,if cluster is still null we can throw an exception?

I modify the code :
long retryIntervalMs = 10;
int defaultRetryTime = 5;
Cluster cluster = null;

while (cluster == null) { int retryTime = 0; if (this.session.getLocalCql() != null) { cluster = session.getLocalCql().getCluster(); } if (cluster != null) { break; } else { logger.warn("Waiting for Cassandra server to start. Retrying after {} milliseconds", retryIntervalMs); if (retryTime++ >= defaultRetryTime) throw new InterruptedException(String.format("Can not connect to cassandra after retry %s times", defaultRetryTime)); Thread.sleep(retryIntervalMs); retryIntervalMs *= 2; }

Not sure whether a retry limit is helpful. If the sidecar has to do something useful, it has to wait until the Cassandra starts. If this throws an exception after certain retries, does that mean the sidecar has to stop? or other services are started but not the CDC reader?

what I mean is that set the process just sleep round and round is not a good choice, or we can just throw exception after some retry . So the users can go to check what is wrong with the process. Users may not know what is wrong with the process if the deamon is just sleep , they may think that the process is running healthy but actually the cassandra daemon is something wrong and the cdc reader is just sleep.

we should let the user know what is going on

Maxwell-Guo · 2020-10-28T10:11:36Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+            // to ensure CDC reader doesn't accidentally step on Cassandra data.
+            this.cassandraConfig.init();
+            // TODO : Load metadata from the CQLSession.
+            Schema.instance.loadFromDisk(false);


why we load from disk ? for all cassandra schema information ,we can just reside on cassandra driver ,we can make first connection to cassandra, and get the information from cassandra driver , also the tablemeta when we need to know if the table's cdc is enable

and I saw your "TODO"

I think I can help with this todo, Now I am working on it

Thanks Max. Added a skeleton CDCSchemaChangeListener. Go ahead with the change.

Maxwell-Guo · 2020-10-29T07:00:13Z

src/main/java/org/apache/cassandra/sidecar/cdc/output/Output.java

+ */
+public interface Output extends Closeable
+{
+    void emitChange(Change change) throws Exception;


besides , I want to add some method such emitChange but the return is partitionUpdate

How the method signature would look like in that case?

Maxwell-Guo · 2020-10-29T07:06:44Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+                }
+                for (TableMetadata tableMetadata : keyspaceMetadata.tablesAndViews())
+                {
+                    logger.info("Table : {}, CDC enabled ? {}", tableMetadata.name,


After read the code, I saw that this code is useless ,If the cdc is enable can get from mutation at the mutationhandler section . So the code here is useless I think ,And we may not doing the schema load disk method to get the schema meta data

Yes, this is just a dummy code for interim development work. I will remove or move these to debug logs.

Maxwell-Guo · 2020-10-29T07:41:40Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+    private final CassandraConfig cassandraConfig;
+
+    @Inject
+    public CDCReaderService(CDCIndexWatcher cdcIndexWatcher, CDCRawDirectoryMonitor monitor, CQLSession session,


CDCReaderService is injected but I do not see where the object is injected ? including CDCIndexWathcher ,CDCRawDirectoryMonitor

To the CassandraSidecarDaemon?

Maxwell-Guo · 2020-10-29T08:20:34Z

src/main/java/org/apache/cassandra/sidecar/cdc/MutationHandler.java

+    Future<CommitLogPosition> mutationFuture = null;
+    private ExecutorService executor;
+    private CDCReaderMonitor monitor;
+    private CDCBookmark bookmark;


we can also add "private Configuration conf;" here and at the constructor function，we can set the conf to input configuration conf ,then we can use the configuration.

or if we can just delete input params Configuration conf

+1, dead code, removing it.

Maxwell-Guo · 2020-10-29T08:22:31Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCBookmark.java

+    private final Configuration conf;
+
+    @Inject
+    CDCBookmark(Configuration conf)


Configuration conf is not used in the class, so I want to know if we will use in the future ?same with MutationHandler class

Maxwell-Guo · 2020-10-29T09:18:35Z

src/main/java/org/apache/cassandra/sidecar/MainModule.java

+    {
+        // TODO: Make the output type configurable
+        bind(CDCReaderMonitor.class).to(CDCReaderMonitorLogger.class);
+        bind(Output.class).to(ConsoleOutput.class);


can we remove this method and like other variables such as HttpServer/Vertx/VertxRequestHandler/Router/Configuration use @provide to get the real value and the the out put can be configurable ;
that is my code;
@provides
@singleton
public Output outPut(Configuration conf)
{
String outPutClass = conf.getOutPut();
if (!outPutClass.contains("."))
outPutClass = "org.apache.cassandra.sidecar.common.output." + outPutClass; // I move the out put to common dir
Output output = FBUtilities.construct(outPutClass, "output");
return output;
}

Output is an interface. Let me see whether there's a benefit to refactoring it like this.

I'm not seeing an advantage to the request, @Maxwell-Guo. The approach @tharanga took is pretty standard for a Guice binding.

Maxwell-Guo · 2020-10-29T09:41:29Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+            this.cassandraConfig.init();
+            // TODO : Load metadata from the CQLSession.
+            Schema.instance.loadFromDisk(false);
+            this.cassandraConfig.muteConfigs();


why we mute the loaded configuration file ? I think now we just use the cdc location ,we can just at first load the configuration get the data and assigned the cassandraConfig to null if you think the configuration may cost some memory

We want to ensure no code path accidentally modifies the Cassandra data. Yes, we don't intentionally do it today, but this code guarantees that never happens.

Maxwell-Guo · 2020-10-29T10:00:39Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+                    continue;
+                }
+                for (TableMetadata tableMetadata : keyspaceMetadata.tablesAndViews())
+                {


here ,I think we can add a loop , that every time we check if there is a table have got its cdc enable in schema info ,
if all table 's cdc is not enabled ,we can just go to sleep for a while , such as commitlog flush period ,in this way for
users that do not use cdc ,we can just save some resource.
also every loop the schema should refresh (I think the cassandra driver have done ,we just use it );
when a table is cdc enabled ,we go to next step;m
my code here is :

// Ensure Cassandra config is valid and remove mutable data paths from the config
// to ensure CDC reader doesn't accidentally step on Cassandra data.
this.cassandraConfig.init();

while (true) { int cdcEnableTables = 0; Metadata metadata = this.session.getLocalCql().getCluster().getMetadata(); List<KeyspaceMetadata> keyspaceMetadatas = metadata.getKeyspaces(); for (KeyspaceMetadata keyspaceMetadata : keyspaceMetadatas) { if (keyspaceMetadata == null) { continue; } for (TableMetadata tableMetadata : keyspaceMetadata.getTables()) { if (tableMetadata.getOptions().isCDC()) ++cdcEnableTables; } } if (cdcEnableTables != 0) break; logger.warn("There is no table enable the cdc , just sleep for %", DatabaseDescriptor.getCommitLogSyncPeriod()); Thread.sleep(DatabaseDescriptor.getCommitLogSyncPeriod()); } this.cassandraConfig.muteConfigs(); // Start monitoring the cdc_raw directory this.cdcRawDirectoryMonitor.startMonitoring();

This code was just added for debugging. Users can alter tables to enable CDC at any time. Sidecar shouldn't wait. Also, whatever user does is not visible to us without reloading metadata. This is where the CDCSchemaChangeListener is helpful.
+1 for the suggestion. When CDCSchemaChangeListener is working, we can add a guard rail to CDCIndexWatcher so it won't read commit log entries when there are no tables with CDC enabled.

Maxwell-Guo · 2020-10-29T11:30:37Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCRawDirectoryMonitor.java

+        }
+        // TODO : Don't be someone who just complains, do some useful work, clean files older than
+        //  the last persisted bookmark.
+        this.monitor.reportCdcRawDirectorySizeInBytes(getCdcRawDirectorySize());


We don't got cdc data clean ? I saw your TODO, and Why we do not clean the cdc log that is useless after we emit the change of the log ?

We do delete commit logs after reading them : https://github.com/apache/cassandra-sidecar/pull/18/files#diff-32afd2c4bf3fe7d4c3268bec08a902cf36fb92b079d9c99fe47bafe0b073d5ceR289. However, if the output is blocked/slow for some reason (e.g. a Kafka node is down), then commit logs can pile up and eventually halt Cassandra writes. This is the default behavior, but we don't want that.

…etadata as it changes in the server.

Maxwell-Guo · 2020-11-17T04:03:50Z

src/main/java/org/apache/cassandra/sidecar/cdc/CDCReaderService.java

+            // to ensure CDC reader doesn't accidentally step on Cassandra data.
+            this.cassandraConfig.init();
+            // TODO : Load metadata from the CQLSession.
+            Schema.instance.loadFromDisk(false);


I know why you need to load the schema from disk....for the commitlog reader should deserialize from log need to have data judgement .

rustyrazorblade

My biggest concern with this patch is organization - not the code itself. I spent a lot of time (months) trying to future proof the codebase where it comes to supporting multiple versions. I left a couple notes, but not an exhaustive review since structural changes may significantly impact the codebase.

I'm happy to go into any of the details of the prior work I did, since it's only 1 commit old it might not be obvious where I was going with it. Let me know if there's anything I can explain in detail that'll help you get more familiar with the codebase.

rustyrazorblade · 2020-11-18T23:12:18Z

build.gradle

@@ -179,6 +179,7 @@ dependencies {

    compile project(":common")
    compile project(":cassandra40")
+    compile 'org.apache.cassandra:cassandra-all:4.0-beta2'


If you take a look at the last commit I added, I spent a lot of time trying to decouple the sidecar from using a specific version of Cassandra. Each version we decide to support can (and will) have an adapter, allowing us to maintain a single sidecar project that can work with different versions of Cassandra each of which has different implementations. There's no assurance that C* 5.0 will have the same CDC implementation as the 4.0 version. Could you please move the version specific logic into the cassandra40 subproject?

In addition, we may want to have the user point to their cassandra lib directory as well in order to not ship every version of C* with the sidecar. That will give us the flexibility for folks to use their own builds (private or public) as well as ship a smaller artifact. Since everyone has to run Cassandra I think this is a fair ask. Using a compileOnly dependency would allow us to test against each version of C* without shipping the jars.

@rustyrazorblade +1 for both suggestions. I was thinking of punting this to a future commit, but I see the work you've done at a4805a910904019698ae373ac33f88855cf67f3d. Let me refactor this code to address both points.

@rustyrazorblade +1

rustyrazorblade · 2020-11-18T23:20:56Z

src/main/java/org/apache/cassandra/sidecar/MainModule.java

+    {
+        // TODO: Make the output type configurable
+        bind(CDCReaderMonitor.class).to(CDCReaderMonitorLogger.class);
+        bind(Output.class).to(ConsoleOutput.class);


I'm not seeing an advantage to the request, @Maxwell-Guo. The approach @tharanga took is pretty standard for a Guice binding.

WIP version of the Cassandra change stream emitter.

81a7030

Maxwell-Guo reviewed Oct 28, 2020

View reviewed changes

Maxwell-Guo reviewed Oct 29, 2020

View reviewed changes

tharanga added 3 commits November 4, 2020 16:12

WIP version of the Cass schema change listener that can refresh the m…

0642027

…etadata as it changes in the server.

Adding Apache licensing headers.

b6eece2

Using the common CQLSession class.

a2ebd3d

Maxwell-Guo reviewed Nov 17, 2020

View reviewed changes

rustyrazorblade requested changes Nov 18, 2020

View reviewed changes

tharanga changed the title ~~CDC reader in Apache Cassandra Sidecar - CASSANDRASC-27~~ (WIP) CDC reader in Apache Cassandra Sidecar - CASSANDRASC-27 Nov 24, 2020

(WIP) CDC reader in Apache Cassandra Sidecar - CASSANDRASC-27 #18

Are you sure you want to change the base?

(WIP) CDC reader in Apache Cassandra Sidecar - CASSANDRASC-27 #18

Conversation

tharanga commented Oct 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rustyrazorblade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tharanga commented Oct 20, 2020 •

edited