-
Notifications
You must be signed in to change notification settings - Fork 597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanned through briefly. Overall LGTM to me.
Add @nlu90 as reviewer who has more experience on Kafka Spout.
@simingweng Im having trouble following what you are doing when there is a failed tuple. Not due to your code being confusing, i am just missing something and would like an explanation. Are you replaying all tuples from the most recent failed offset? From what I have read in the code, i have come to believe this is your process, is this correct?
The code looks good. I plan on using this shortly and will hopefully have more feedback after i get into it. Thanks! |
The Travis CI failed due to the following test failure: It seems to be due to the actual JSON output just had different order in the JSON array compared to the expected output, which, I'm pretty sure, has nothing to do the maven test I added for the Kafka Spout. Can someone familiar with the integration test code base help with this
|
@worlvlhole
|
@simingweng Thank you so much for the in depth explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. A good start for iteration. Thanks!
This information might be included in the document? |
Hi, thanks for this PR, happy to finally see a Kafka Spout implementation in Heron. I am planning on using this once it is merged, but I have a question. One major difference between this implementation and the Storm one is that Storm's spout allows emitting to different streams, using the Is there a reason for not keeping this functionality in this Kafka Spout implementation? Or is there another way to achieve similar functionality (other than creating a map-function like bolt for this purpose)? It's really useful for sending data from different topics to different downstream bolts. |
Slightly unrelated but, I think the need to set a list of spout
requirements is still a big discussing that's needed.
…On Mon, Mar 18, 2019, 4:45 PM Rohan Agarwal ***@***.***> wrote:
Hi, thanks for this PR, happy to finally see a Kafka Spout implementation
in Heron. I am planning on using this once it is merged, but I have a
question. One major difference between this implementation and the Storm
one is that Storm's spout allows emitting to different streams, using the
org.apache.storm.kafka.spout.RecordTranslator interface. This
implementation is missing this particular functionality, which is quite
useful.
Is there a reason for not keeping this functionality in this Kafka Spout
implementation? Or is there another way to achieve similar functionality
(other than creating a map-function like bolt for this purpose)? It's
really useful for sending data from different topics to different
downstream bolts.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3198 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHNon-PwB3e0EsUXtX27qZuH_MQUo8nmks5vX_rlgaJpZM4bQ7Tz>
.
|
Yeah. The requirements will evolve. Feel free to add things into the README file. |
Interesting. Is topic -> stream 1 to 1? Why not have multiple spout components and each one is responsible for one topic? |
very good question. I actually started with a "one-record-to-many-tuple" implementation, then I gave it a deep thought when I was implementing the And then we also face a design choice whether the KafkaSpout itself should decide the uniqueness of a set of Message IDs coming from the same ConsumerRecord, or we should open the choice up to the developer? So, a neater choice is to use multiple KafkaSpout, each dedicated to an output stream. But, I do agree "one-record-to-many-tuple" is pretty useful and cost effective in terms of resource consumption. I have no obligation to put it back in, but then it becomes the developer's responsibility to make sure avoid emitting multiple tuples out of one ConsumerRecord ONLY in |
Topic -> Stream is 1 -> 1 in our use case, but in Storm's implementation, it can be configured to be many -> 1 as well.
Agreed that one record -> many tuples is really useful, but it does make tracking offsets a lot harder. However, I don't think that is possible using the current Kafka spout implementation from Storm - it only allows 1 record -> 1 tuple emits, but to any declared stream. |
Got it. Thanks. many -> 1 does sound useful in resource constrained env. It should be doable but there might be extra logic and config. I am thinking that maybe it can be created as a separated spout and share code with this 1:1 version.
Are these streams the same or they are different (like apply different filters/transforms)? If they are the same, what is the difference between it and spout has one output and multiple bolts register to that stream? Different offsets? |
I think @rohanag12’s use case is that he has multiple topics and the record from each topic will only be translated into one tuple, and emit to one stream. Basically, it is “topic1” -> “stream 1”, “topic2” -> “stream 2”, etc. So, he would like to use one KafkaSpout, subscribing to multiple topics, and have the This pattern can save user one extra “distributor” bolt connected to the KafkaSpout. Basically, the routing logic becomes embedded in the KafkaSpout itself. But again, the trade-off of this flexibility is that developer needs to be careful not to emit multiple tuple out of one single record in ATLEAST_ONCE mode. So, as long as one record is only translated into one emitted tuple, then the current KafkaSpout implementation shall work in ATLEAST_ONCE mode. |
allow defining multiple output streams in ConsumerRecordTransformer
@nwangtw @rohanag12 I've just pushed one more commit to address the multiple output streams support, and also, now the spout is able to support EFFECTIVE_ONCE mode. |
I see. So like N -> 1 and N -> N cases, not the 1 -> N case. Thanks. |
* initial commit * remove dependency on heron-storm compatibility library * insert Copyright header * use constant in ConsumerConfig for client configuration add sample topology * change to default Kafka Broker port number * add Javadoc documentation * add documentation 1 * include unit tests for heron kafka spout in Travis * add documentation 2 * support EFFECTIVE_ONCE mode allow defining multiple output streams in ConsumerRecordTransformer
* initial commit * remove dependency on heron-storm compatibility library * insert Copyright header * use constant in ConsumerConfig for client configuration add sample topology * change to default Kafka Broker port number * add Javadoc documentation * add documentation 1 * include unit tests for heron kafka spout in Travis * add documentation 2 * support EFFECTIVE_ONCE mode allow defining multiple output streams in ConsumerRecordTransformer
This is a KafkaSpout implemented using Heron Spout API.
It supports both AT_MOST_ONCE and AT_LEAST_ONCE message delivery guarantee, and there's a sample project showcasing how it is used in the simplest topology on a Simulator. The sample topology requires a local Kafka broker to run and test.
The project is currently a Maven project rooted at
incubator-heron/contrib/spouts/kafka/java/heron-kafka-spout-parent
, and depends on publicly available maven dependencyheron-api
. When it is converted to a Bazel target, it should cross reference the in-tree Heron API target.