New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: consider adding additional sinks via go-cloud/pubsub #36982
Comments
This would be super useful as we think google pubsub would be enough in our microservices infrastructure. Mainly because:
Do you think this is something we could push forward? |
+1 for this |
@danhhz Would you mind taking a look at the GCP pubsub ordering documentation to confirm whether it'd be possible to reliably use pubsub to materialize a table sent through CDC? It doesn't appear that pubsub natively provides any ordering guarantees, but they have a metric that the consumer can monitor to determine how caught-up a pubsub feed is. I think this means that you could buffer data until that metric advances, then reorder the data based on updated/resolved timestamps and process it in strongly-ordered batches. Do you agree? Or are there subtleties that I'm missing that would prevent this from working as expected? (To be clear, this question is specifically about the GCP pubsub product, not the go-cloud pubsub library, which provides an interface to various cloud and self-hosted messaging systems.) |
I think you'd need some way to tie that "how caught-up a pubsub feed is" metric back to our resolved timestamps. Basically, when some consumer sees a resolved timestamp message, it needs a way to wait for every message before that one to have been provably flushed out of the pubsub. To do this with the metric you mention (an age in seconds), we'd need a way for a pubsub consumer to compute That said, I have two thoughts here.
(Interestingly, all of the above metadata is already now present for the cloud storage sink.) cc @ajwerner any thoughts on above? ^^
|
Sounds right to me.
Technically this is true given the complete lack of ordering guarantees. I suspect that in the common case the amount of state which needs to be tracked will be quite small. Most pub sub systems provide some sort of latency expectations. I think it's reasonable to think of these pub sub systems as at least offering something of a synchrony guarantee. I don't expect a message to take 5 days to arrive and then eventually arrive. Maybe a message doesn't arrive at all, that'd be a problem, but that'd be a problem for all of the sinks. The state tracking is sort of a pain. It effectively boils down to keeping track of the frontier except you probably need to do it in a distributed way. It wouldn't be too bad to do if the client was talking to a scalable, consistent SQL database in the course of processing the message. I think it's a good idea to play around with implementing the GCP sink on top of the generic interface with a prototype, maybe side-stepping the guarantees on a first pass just to get a feel for it. Then we should add on the necessary metadata to do frontier tracking and verify that we can make it work. I agree with your assessment about the additional metadata that it would take and that we already have it. My guess is while the code is non-trivial the amount of state you practically need to maintain isn't going to be that large in use cases which don't have high throughput. |
Quoting @glerchundi :
Quoting @danhhz:
He's right, there are a lot of usecases where you don't need ordering, just "as least once delivery" guarantee. Thanks for pushing this forward @rolandcrosby |
Yeah, I think you're right that I'm being unfairly pessimistic here. The worst case is unbounded but that's certainly not the common case.
I had the same thought! 😆 I guess by definition you do have access to one.
Agreed, I also think this is the next steps. The library mentioned in the issue even has an in-mem implementation, so maybe it's easy to hook up to testFeed. |
Quoting directly from PubSub documentation:
https://cloud.google.com/pubsub/docs/subscriber So there would be a couple of options here:
|
This is a WIP PR to implement a changefeed sink for the generic pubsub API exposed by gocloud.dev/pubsub. It does not currently deal with lack of ordering for the nemesis test. Fixes cockroachdb#36982. Release note (enterprise change): Add support for gcp pubsub as a CHANGEFEED sink.
Having worked a lot with RabbitMQ in the past, within the Cloud Foundry realm, I'm a big fan and would love to see it as an option here. |
Not sure if y'all have seen this or not https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector could be a helpful example on how to setup CDC with dataflow. Specifically I'm trying to make sure that I can get the changes that are in cockroachdb into big query |
Any news on this? We are using |
I don't know if the proposed solution would vary with the ordering guarantees they provide in Pre-GA. Just pinging you all so it can be taken into account. |
Thanks for the update. We can take this into account when we do finish this off. |
Here's a first draft of a CDC -> PubSub bridge that uses an HTTP feed: https://github.com/bobvawter/cdc-pubsub |
I'm looking into moving this forward. Below are some notes, many of which just re-iterate ideas presented above. Google Pub/Sub Ordering KeysGoogle Cloud Pub/Sub provides opt-in ordered message delivery (https://cloud.google.com/pubsub/docs/ordering):
We'll likely want to think about the UX around topic & subscription creation since a subscription only receives messages sent
Ordering ConcernsWhen thinking about ordering for this feature, there are at least two things we might care about:
The method outlined by @danhhz in #36982 (comment) would work without ordering turned on at all. This is nice because (1) it may be that some users want to be able to use the RESOLVED messages but don't want the restrictions that may come with turning on ordered delivery, and (2) it could be reused for other sinks that don't provide ordering. The summary of that method as I understand:
If we wanted to rely on ordered delivery, I believe we would need to send a RESOLVED message for every ordering key we are using, like we do for kafka partitions. We would likely want to provide some option for how many "ordering keys" we shard rows over. |
I am fully on board with not doing anything related to ordering at this point. Adding ordering would be purely additive if we have testing and support without it. Adding it can also be a very small change where we'd do something like set the hash of the row's key as the ordering key. |
FWIW the per-row ordering guarantee can be very useful even in the absense of an easier to use resolved timestamp. Imagine you're synchronizing rows to some other system, like, say, elasticsearch, you can do an atomic CAS on a value using a version integer. I'd do something like read the existing row and then insert the new one if the timestamp on my changefeed event is higher than the one that's in there. In that world, I'd get correctness out of per-key ordering without ever needing to think about a resolved timestamp. |
+1 I am starting with some per-row ordering and going from there. |
@amruss this one is a bit old... Should we close this issue? |
Since we announced CDC support targeting Kafka, we've had a handful of requests to support different message queue systems for CDC: Google Cloud pubsub, NATS.io, AWS Kinesis, RabbitMQ, ActiveMQ, and others. @gedw99 pointed me to the Go CDK pubsub implementation, an extensible module which aims to support a variety of sinks across cloud providers, including many of the ones we've gotten questions about. We'd need to do more analysis to make sure the delivery and ordering guarantees offered by individual sinks are appropriate for CockroachDB, but aside from that overhead this seems like a relatively cheap and painless way to add new sink types.
Note: This issue has been scoped down to focus on the go/cloudpubsub package
Jira issue: CRDB-4467
The text was updated successfully, but these errors were encountered: