-
Notifications
You must be signed in to change notification settings - Fork 25
Implement cdc backfill skeleton #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
081b4f1 to
832cccf
Compare
832cccf to
db324b0
Compare
|
Sample run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
I think that we can commit this patch as initial step.
I have a couple of questions
OSS license
Can anyone use this tool or do they need a license for DSBulk ?
Cassandra compatibility
we must support C3, C4 and DSE, IIRC you mentioned that we need different dependencies. That's not a big deal, but do you know if DSBulk will work with all the supported versions ?
Dry run mode
We should add a "--dry-run" mode in which we read from Cassandra without writing to Pulsar
Disk space and retention
We should make it clear that we are storing locally some files, for two reasons:
- The local machine must have enough space
- The user MUST remember to delete everything, because it may contain sensitive data
I suggest to add a flag to auto-delete the data in the end of the procedure (and enable the feature by default)
what happens if you run the tool on a non empty directory and you have data from other tables ?
| //String md5Digest = DigestUtils.md5Hex(dataOutputBuffer.getData()); | ||
| TableMetadata.Builder builder = TableMetadata.builder(exportedTable.keyspace.getName().toString(), exportedTable.table.getName().toString()); | ||
| exportedTable.table.getPartitionKey().forEach(k-> builder.addPartitionKeyColumn(k.getName().toString(), UTF8Type.instance)); | ||
| TableMetadata t = builder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can cache this as follow up work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Will send separate patch with pulsar ready support.
|
|
||
| AgentConfig config = AgentConfig.create(AgentConfig.Platform.PULSAR, tenantInfo); | ||
| PulsarMutationSender sender = new PulsarMutationSender(config, true); | ||
| List<CompletableFuture<?>> futures = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up work:
we have to limit the size of this List and also limit the number of pending writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Will send separate patch with pulsar ready support.
| List<CompletableFuture<?>> futures = new ArrayList<>(); | ||
| long c = Flux.from(connector.read()).flatMap(Resource::read).map(record -> { | ||
| //System.out.println("Sending to pulsar " + record.toString()); | ||
| Object[] pkValues = new Object[]{record.getFieldValue(new DefaultIndexedField(0))}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not clear to me how we deal with complex PKs and UTDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll verify, my assumption is by reusing the CVSConnector from dsbulk, it handles those type.
Please note that it is discouraged to use UST as PK, but I'll verify it is supported: https://www.datastax.com/blog/cql-improvements-cassandra-21
Please note however that there is relatively little advantages to be gain in using a UDT on a PRIMARY KEY column, avoid abusing such possibility just because it's available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second thought, I'll relay completely on the agent code for the conversion:
cdc-apache-cassandra/agent-c4/src/main/java/com/datastax/oss/cdc/agent/PulsarMutationSender.java
Line 59 in a03137c
| public class PulsarMutationSender extends AbstractPulsarMutationSender<TableMetadata> { |
The input will be the CQL types as I read them from TableMetadata from the source table (I have this info outside the CSV file on disk) and use the agent to map to avro. I expect to add some adapter logic to map the CQL between what I get from the TableMetadata, and what the agent logic get from the CommitLog processor but it should be feasible.
Yes. DSBulk switched to Apache License V2 few years ago and permits distribution of the software.
Yes, dsbulk supports DSE and C2.1 and later. That covers the table export part. There is also a C*/DSE version dependent logic which translating the schema of the PK to AVRO before sending records to the events topic. That part will be covered by reusing the agent code. In this patch I hardcoded it to use DSE. With some refactoring, we should be able to fully reuse the agent code so the net result, C_3, C_4, DSE will be supported
+1. I'll capture this requirement
+1. Will add to read me. Please note that we can add a flag to automatically delete the local snapshot after the backfilling completes, but we need to highlight that enough disk space should be available. We can consider optimization later on on sending data in batches, delete them from disk, and then continue with other batches which could be over engineering for now.
Yes!
Each table will be stored in its own directory following keyspace/tablename heirarcy. If the user have data in the directory with the same keyspace and table name, but those don't actually belong to the table, the tool will not know. Now if the PK mismatches, publishing to events topic would fail, but otherwise the tool doesn't know that the data has been manipulated. One we to mitigate, would be to generate random foldername and a new snapshot every time the tool is run, but that would complicate the capability to resume work after the tool is paused because it will require a manifest file (and the user can still manipulate the file). Encryption is another thing we can consider but for V0 I think the existing model suffices. |
Initial skeleton for the CDC backfill CLI tool: