Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store processed log to Kafka or Cassandra with Manual Commit? #51

Closed
evans-ye opened this issue Nov 17, 2015 · 5 comments
Closed

Comments

@evans-ye
Copy link

Sorry I can't find a proper place to ask so I just open up an issue here (Close it if you think it's improper).

I'm currently doing a PoC based on this awesome reactive kafka module with the manual commit feature.
And I'm struggling adding a Sink to store log into some permanent storage system such as Kafka or Cassandra.
From your sample code messages are being processed on-the-fly in processMessage function, if I need to store data into Kafka, then I need to replace offsetCommitSink to another Kafka Sink, but in that way I can use offsetCommitSink to stream back for commit.
Another approach is to use a saveToKafka function to store processed log into Kafka(Shown in below), which is the current implementation of my PoC.

Source(consumerWithOffsetSink.publisher)
.map(processMessage()) // your message processing
.map(saveToKafka(
))
.to(consumerWithOffsetSink.offsetCommitSink) // stream back for commit
.run()

Do you think this is the best practice to achieve my goal?

@kciesielski
Copy link
Contributor

@evans-ye Have you considered using Broadcast from Akka Streams' DSL? It allows "forking" the stream so that you could try to send your processed messages to two "branches": your topic where you save and the commit Sink.
Example: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0-M2/scala/stream-quickstart.html

@evans-ye
Copy link
Author

Yeah, thanks for your prompt response.
CMIIW, my thought is when using broadcast, there's no guarantee that commit will be executed after data has been saved to my topic, hence there's a chance to lost data.

Let's say the stream saving data to my topic is slower, and the committing stream is faster.
When failure happened at the point that offsets get committed back, but the other stream saving data to my topic is still under processing, the data under processing is lost because when next time we start to fetch offset, the offset has been set latter.

@jasongilanfarr
Copy link

You definitely can do this (and I do) by using a broadcast and some flow stages so you'll commit only after the persist to Cassandra future completes. As a side benefit, this pseudo flow only sends the message itself to your persist method.

Source(publisher) ~> broadcast ~> Flow.map(.message).mapAsync(1)(persist) ~ zip.in0
broadcast ~> zip.in1
zip.out ~> Flow.map(
.2) ~> commitSink

@evans-ye
Copy link
Author

OKAY! I got your point. This looks perfect to me.
Thank you so much for the great help and the great module!

I only have one more question, which I think is the benefit using your module:
Comparing the reactive kafka with akka streams + akka persistence, which both provide at-least-once message guarantee, reactive kafka is way much efficient since we only write back offsets, while akka streams + akka persistence needs to persist every log down to disk. Am I correct?

@13h3r
Copy link
Member

13h3r commented Apr 30, 2016

Looks like we done with this

@13h3r 13h3r closed this as completed Apr 30, 2016
@ennru ennru added this to the invalid milestone Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants