-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBZ-2514 #43
DBZ-2514 #43
Conversation
55828bc
to
aa65e0e
Compare
Hey @smiklosovic, thanks a lot, in particular for doing the suggested commit reorganization. Will try and take a look asap. On a very high level, I am surprised by the high ratio of added vs. removed lines, it seems to indicate that there's quite a lot of code which isn't shared between 3.x and 4.x? Maybe this will get clearer for me when actually diving into the code. @bingqinzhou, @fuxiao224, if you have some time, can you take a look as well? I think the most interesting part for you probably will be: will it break things with Cassandra 3.x? IIUC right, it does not, this change proposes a significant reorganization of the connector's code base though, with two separate connectors for Cassandra 3.x and 4.x, and a core module for the parts shared between them. |
The high ratio of changed lines is because almost everything from 3 went to core. It has nothing to do with 4 as 4 is just pulling stuff from core after extraction was done. To reiterate: common stuff for 3 and 4 is all moved to core and 3 and 4 are just depending on it. There is virtually nothing more to extract, core is Cassandra version agnostic (minus dependency on DatabaseDescriptor and some classes which need to be present in core in order to compile). I think that current 3 impl is just a drop-in replacement of what is there now. FYI as I moved to Cassandra Driver of version 4, you need to use different way to connect to Cassandra node, the family of properties related to connection to that node was removed and all is done via configuration file mechanism in Cassandra 4 driver (this is configurable via your properties mechanism too, you will find it if you take a closer look to configuration). https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/configuration/ |
I'll take a look soon later this week or early next week. Currently I'm a bit concerned about |
Hi guys, no offence but regardless of other changes this branch incorporates, once I started with the extraction I very soon realised that the previous code-base is not prepared for that extraction too much. It was not "elastic" enough to just move the code here and there, I had to figure out how to make it happen because things were just wired in there without any significant effort which would make that extraction easier in the future. So yeah, that is why there is a lot changes but they are trully mostly about imports and re-organisation the code to make this happen. By the way, 119 files changed related to the first commit are literally just about moving the from one directory to the other, what's the deal there? All the real changes are in the second commit instead. I have changed driver version from 3 to 4 (details are in the ticket). There were also some problems with mappers when one wanted to reuse them across Cassandra versions 3 and 4 (this is very low level and I could go there if you want). I have summarised all details in the ticket itself in a very elaborative manner so I do not see why I should repeat it all here, honestly. Everything is there, in details. Next, it seems that the code was not / is not prepared for better unit testing. One reason I just didnt provide In my opinion, the outcome of this set of changes should be if we agree on the general approach I took and if we do, I will need to focus more on testing part and implementing Cassandra 4 way of handling things. Yes, this is more about Cassandra 3 and Cassandra 4 instead of supporting Java 11 but I do not see any problem having this effort put together. I can definitely investigate how to make Cassandra 3 / 4 impls to be built / run with Java 11. If you think the ticket does not fit the changes (or other way around), why dont we create a new ticket with Cassandra 4 support and we put these changes under it if that is your concern? Thanks |
422ab4d
to
8bbc8e6
Compare
For people who use the latest version of this connector, will they be able to run this connector on Cassandra 3.X cluster or not? |
Hi @bingqinzhou , cassandra-3 impl should be a drop-in replacement but the only user-facing change is that this is on Datastax Cassandra driver of version 4.11.1. That means that the configuration of the connection is not done in debezium properties file but it is done in a dedicated file the driver can understand. There is new configuration property introduced in the current configuration which points to a file path of the driver configuration where everything is put. example configuration, this is
Reference: https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/configuration/reference/ |
Hey @bingqinzhou, do you think you folks could take a closer look at this after the holidays? Would be great if we could converge on the path forward here for the upcoming 1.9 release (Alpha1 due some time in January). |
Can you link the ticket you mentioned which includes the full context about this PR? I read your comments above but they don't answer the following questions clearly:
This PR includes major refactor to the original code base, which I don't think is necessary at this moment. If I understand the situation correctly, the goal of this PR is to upgrade Cassandra Driver version for this connector. In that case, I only want to keep the code changes which are closely related to the version upgrade. Currently, many of us have been running this connector stably on PRD for a while. Making major refactoring to the code base brings risk to break original functions or debugging patterns for us. If you feel it's necessary to make such major refactor, please create a separate ticket and PR for us to discuss about later. |
Hi @bingqinzhou , I was on a leave whole January.
No, the goal is not to update Cassandra driver. You have that part wrong. The whole philosophy how logs are processed was changed in Cassandra 4 and this needs to accommodate as well as the internals of Cassandra 4 are simply different from Cassandra 3 and it is just not applicable to Cassandra 4 anymore so we need to make a difference by having two modules. More to it, it is more like in order to support Cassandra 4 Debezium connector, one needs to use Cassandra driver of version 4. But since the internals of this plugin are so hard-wired, it will basically render us to use driver of version 4 for Cassandra 3 as a side effect. I think we are moving in circles here, I have explained what I did with the refactorisation and it is done like that. This change introduces refactoring into cassandra-3 and cassandra-4 modules with one core module both implementations are depending on. You can not do it like one module would cover both Cassandra versions. That is just not possible. So we need to have two modules for that. If you do not want to proceed with this, feel free to reject this and we will implement it on our own. I have a feeling we are actually doing you a favor and if no route is viable ... well, what should we do? |
Hi @ALL, Personally, I have started testing the new implementation from the upstream and hope that will be discussed and merged soon here into the main project. Cassandra-4 is released a few months ago and a lot of users have already started to upgrade to this new version, for those who use Debezium, it's very important to have a new version that supports Cassandra-4. Cheers |
BTW I dont understand why this project can not just branch off current main to support the original plugin if there is a concern with a "too big change". What is wrong with taking care of 1.7.0 or what have you while having a plugin of version 2.0.0 where this refactoring would take place? I just dont consider the argument "this is too big change" relevant. That is what Git is for? It is not like people can not go back to older plugin and patch things if they find it necessary. Nobody prevents people from checking out the current code and applying fixes. So if the older plugin works for you, let's just stick with that one, I guess? One just does not have to use "the latest and the greatest". |
@gunnarmorling @bingqinzhou We, like you @bingqinzhou, have Debezium that works well on production with Cassandra-3 but with this version of connector we will not have the "chance" to plan an upgrade to Cassandra-4 soon... |
Hey all, sorry for the silence on my end. I need to dive into this again, as it has been a while. I'll get back to this by the end of this week. From my PoV, having a) support for Cassandra 4.x is a top priority, b) not breaking the existing 3.x support is equally important, and c) the ability to build with Java 11 is also critical but a bit less important than a) and b). I'll reply with some more thoughts on possible branching/versioning strategies once I've looked more closely and hopefully remembered the details. What I can say is that I'd love to have support for Cassandra 4.x in Debezium 1.9 (i.e. the current minor version that's being worked on and which is due by end of Q1). |
Hi @smiklosovic, thanks for the explanation. I'm not able to test the change at this moment since we haven't and don't have the plan to upgrade to Cassandra 4.x yet. But if @ahmedjami has tested, it I'm good with merging this PR if it looks good to @gunnarmorling as well. |
I am cooperating closely with @ahmedjami to have it merge-worthy. I ll keep you guys informed. Thank you very much. |
Hey @smiklosovic, so I'm trying to build this locally, but I'm getting several exceptions, indicating ZooKeeper is started more than once ("java.net.BindException: Address already in use"). The CI build here is failing too (see above). Could you rebase this PR to the current main branch, resolving the merge conflict in the POM too? I'll comment in a sec on some things in the POM which will be needed to make this work with the latest main core (1.9.0-SNAPSHOT). Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smiklosovic, some minor comments inline for now, just to make that thing build with the current version of Debezium core:
- Use Debezium 1.9.0-SNAPSHOT
- Use Logback for testing instead of log4j 1.2.x
You can take a logback config from the connectors in the debezium-core repo as template for changing the log config. Thx!
@gunnarmorling interesting, are you sure there isnt another zookeeper running? i am just building it fine locally. I ll update it to 1.9-SNAPSHOT. |
LOL, yes, I do have another ZK running indeed. Not sure why I didn't check that. Thx for pointing it out :) Never mind, I see we test this differently than the other connectors, by starting AK and ZK in embedded mode. |
Yeah I am not sure I am doing that right but my idea to have everyhing embedded, kafka, zookeeper, cassandra so you do not need to run any helper Docker containers or so ... |
Yeah, so I think that's ok for now. Eventually, I'd prefer though if this connector also uses containers for starting up Cassandra, similar to what we do for all the other connectors. It'll avoid issues like we currently have with the Cassandra 3.x connector which only can be built with JDK 1.8 as it it runs Cassandra 3.x in embedded mode. Assuming the driver itself can be used with later Java versions, we could move Debezium entirely beyond Java 1.8 by running Cassandra in a container (which then would have Java 1.8, but fully isolated from Debezium). But we can take that separately, so as not to further increase the scope of this change which is big already. |
@bingqinzhou, @smiklosovic, @ahmedjami, some thoughts the overall strategy here:
This all is to say I personally don't see any strong reason for not moving forward with this change (though I still need to dive a bit further into the code changes, and I may have some suggestions for adjustments); on the contrary, it's an important next step for this connector. As said, getting some testing of the reworked 3.x connector would be great for increased confidence. Thanks, all! |
8bbc8e6
to
5dead1b
Compare
I am letting you that that driver 4.14.0 was released with the fix. Is there any movement here in this matter from your side? |
I tested it with Cassandra-4 and I confirm that it works :) DBZ detects new CDC tables when it's already running. I still have a few more tests on staging env to do next week then I think @gunnarmorling can merge this PR. Greetings, |
Hi @smiklosovic @gunnarmorling
PR can be merged :) @bingqinzhou In our case, the delay with Cassandra 3.x is about 10min since we force a Greetings |
Ok, cool. So I'll take another look and I think we'll have this in for the CR1 release due in three weeks (Beta1 is due tomorrow, and I won't find the time before then to merge it), so it should be part of the 1.9 release train. Thanks a lot for reporting back, @ahmedjami! |
@ahmedjami Great news! Thank you for confirming this :) |
@gunnarmorling any progress here? :) |
Hi @gunnarmorling, |
Hey, yes, it should.
… Message ID: <debezium/debezium-connector-cassandra/pull/43/c1072173556@
github.com>
|
Rebased locally and applied. Huge thank you for this massive piece of work, @smiklosovic! Really cool to see that Debezium does support Cassandra 4.x with that now. Also thanks a lot for your patience with this, it took quite a while for getting this in (apologies for any delays caused by me). I will log a follow-up issue for reworking the 3.x tests so that they use Cassandra in a container, allowing us to move the build of all the Debezium connectors to Java 11. |
wow this is great ! |
Hi @ahmedjami ! Thanks for sharing this! May I know why we still need to force run |
Hi @fuxiao224 , please read this blog post https://cassandra.apache.org/_/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.html Cassandra can only delete a segment after all its mutations are persisted in SSTables. Knowing if a file does not hold any mutations that haven’t been flushed yet requires a bit of bookkeeping. Cassandra can only delete a segment after all its mutations are persisted in SSTables. Deletion of a segment = saving commit log to cdc_raw dir. All mutations persisted in sstables = everything if flushed. Since a commitlog contains mutations from all tables, flushing some particular table to sstable does not mean that there will not be any other mutations in a commit log which are not flushed yet. I think that you might achieve what you want if you set that |
Hi, You have to check that you insert sufficient data so the commitLog will be discarded and the index file (<segment_file>_cdc.idx in your cdc_raw_directory) will contains the word COMPLETED. So, we need a completed commitLog before seeing new messages into kafka topic :) |
Hi @ahmedjami Thanks again for sharing! I wonder what is the current delay time after removing the |
It depends on traffic and how much time the commitLog takes to reach the limit defined by the parameter commitlog_segment_size: https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L544 Note that you could also change the value of this parameter, In our test environment we set it to 8M since we have less traffic than the production. |
I haven't tested it yet but this new parameter seems to be useful for you as DBZ shouldn't wait for the COMPLETED into the index file and process commitLog in "real time" : https://github.com/debezium/debezium-connector-cassandra/blob/main/core/src/main/java/io/debezium/connector/cassandra/CassandraConnectorConfig.java#L310 |
Thank you very much @ahmedjami ! The info ^ is very helpful. I'll run some tests and dig more into it. |
This is not completely true. It works differently.
There is this work merged: #69 What it does is that it will not wait for COMPLETED to be there. Instead, Cassandra updates cdc index file and it will write there the position in the file where mutations are. And then after some time when new mutations are added, it will again update that position. So what we did was that we started to parse this index file and as soon as the position changed, we read it from the last position to the new (you can seek). So we are basically reading this near-realtime. We read it every 10 seconds by default: This number should not be abused to be set to something like 1 second. There is some performance penalty to seek so often. 10s is nice compromise. So, as soon as your data comes to commit log and index file position is changed, we scan new mutations not later that 10 seconds after that. What you described is what we did initially. But there was always this room for improvement and I am glad somebody implemented this. You do not need to call flush with this nearly-realtime feature. Every single mutation goes to the commit log first. Then to memtable. If you flush it, memtable goes to sstable and commitlog is "changed" so it knows what mutations were flushed. Flushing in a lot of cases writes all data to sstables in such way that cdc is eligible for deletion. But it is not always the case. If you want to process mutations more realtime-ish, you should try that stuff I posted. |
Hey @smiklosovic @fuxiao224 Also there is debezium documentation around this feature at: https://debezium.io/documentation/reference/stable/connectors/cassandra.html#reading-the-commitlog |
Hi @smiklosovic What I understand is that this new implementation (real-time processing) works only when How DBZ works without the real-time option ? It's reading the index file every 10s, waiting for the COMPLETED before processing the complete commitLog ? Or I'm missing something? Thanks again for explaining ;) |
as I described above already: What it does is that it will not wait for COMPLETED to be there. Instead, Cassandra updates cdc index file and it will write there the position in the file where mutations are. And then after some time when new mutations are added, it will again update that position. So what we did was that we started to parse this index file and as soon as the position changed, we read it from the last position to the new (you can seek). So we are basically reading this near-realtime. We read it every 10 seconds by default: |
Thanks @smiklosovic for your reply. What you described is also what DBZ does with DEFAULT_COMMIT_LOG_REAL_TIME_PROCESSING_ENABLED = false ? We did not change this parameter and we thought that we did not enable the real-time-processing. |
It is false by default so it acts as previously. You need to set it to true to enable this "realtime-ish" behavior. How DBZ works without the real-time option ? It's reading the index file every 10s, waiting for the COMPLETED before processing the complete commitLog ? Or I'm missing something? Yes, this is true. Only if it set to true you get that realtime behavior. |
Hi @smiklosovic @ahmedjami |
@jpechane ^^^^ |
Hi @smiklosovic ! |
@fuxiao224 Please use Debezium chat and mailing list. |
No description provided.