Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
DBZ-607 Add Cassandra connector to debezium-incubator #98
The Cassandra connector is an implementation of change data capture for Cassandra. It is intended to be deployed on each Cassandra node, acting as an agent that parses the commit logs from cdc_raw directory and generate events to Kafka.
Note: currently it does not handle deduplication, ordering of events, or hydration -- guidelines will be provided on how each case should be handled.
Detailed documentation will be updated/linked in the following days/weeks.
Update: see debezium/debezium.github.io#325 for documentation (still a WIP as well).
Wow, this is freakin' awesome, @jgao54! I'm diving into this right now.
Some very high-level questions upfront:
Thanks a lot! I'll get back to you with more feedback in a bit.
gunnarmorling left a comment
Thank you so much @jgao54, this is great work!
I did a first review pass and added some comments inline. So far it's mostly related to aligning this with existing concepts/code from the debezium-core module. It seems there are some pieces which are very similar to existing code (e.g. the
I'll keep diving into this, probably coming back with some more questions more related to the Cassandra specifics, once I actually understood them better :)
Really excited about this and am looking forward to this connector very much!
@jgao54, on the first two: would it be an option to set up one node in the Cassandra cluster so that it receives the changes from all tables and then set up the agent just on this single node? Just throwing out the idea -- I've not much experience with Cassandra, so it might not be doable.
And WDYM by "hydration" in this context? Thanks!
Cool. Also sent a PR against your fork with some tiny fixes related to dependencies.
I also believe now I get what you mean by "hydration": recovering the full row state if there are only partial update events (only containing affected columns). I think ideally that'd just be done on the sink side. E.g. SQL naturally would let you update the given colums only, although I just learned that apparently the existing JDBC sink connector always does full row updates. Elasticsearch would be another case AFAIK.
IIRC I saw the idea of using KStreams as a solution to recover full state if not supported by sinks themselves; I could definitely see that being one tool in the box. Interestingly, the same question just recently came up for MySQL (via non-full binlog row mode). So this would be another case benefitting from this. In any case it's a separate discussion, but I wanted to get out my thoughts around it :)
@gunnarmorling addressed your questions inline.
Given a lot of the work is regarding to replacing custom class with DBZ class, I will create subsequent PR to address each one of them.
Also RE: comments above:
Odd, all tests passed for me. what is the error you are seeing?
Right now it is hard-wired yes, i think it would be nice to use the Schema class in Kafka Connect to support various serialization, as long as it's not too coupled with the rest of Kafka Connect. I haven't looked into it in details.
See inline comment.
Not right now, tests have been done manually so far, definitely going to add some IT down the road.
Thanks! I'll make the change.
Yep, that's what i meant.
This is interesting. I think one tricky part is where to put the state store. Shouldn't be local ideally, because it's the same instance which Cassandra runs on. If it's remote, it could be slow to fetch reads, which could potentially lead to a growing backlog.
One thing @criccomini and I were discussing today was that in Cassandra 4.0 there is a new consistency level called
Yes, so that's what you'd get with Kafka Streams. The connectors would write the data to Kafka topics, Kafka Streams would read from there, hydrate the complete state and write back to another Kafka topic. State stores would be local to the Kafka Streams node(s) by default via RocksDB.