Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation support #54

Closed
JHK opened this issue Nov 14, 2018 · 21 comments
Closed

Instrumentation support #54

JHK opened this issue Nov 14, 2018 · 21 comments
Assignees

Comments

@JHK
Copy link

JHK commented Nov 14, 2018

To do some deeper introspection on what is going on when receiving or publishing messages it would be useful to have an instrumentation interface compatible to Active Support Instrumentation, default might be just a NullInstrumenter which is just discarding information. To have an idea what might be actually useful to instrument be inspired by ruby-kafka:

  • message producing
  • message delivery
  • message polling
  • join/leave consumer group
  • (re-)assign partitions within consumer group
  • offset changes
  • consumer heartbeat
  • connection updates
  • probably more...
@thijsc
Copy link
Collaborator

thijsc commented Nov 14, 2018

Thanks for asking. And also for using this gem in racecar! :-)

I have considered integrating AS instrumentation, but given the nature of the underlying C lib I don't see a way in which that approach works well. Did you see the statistics callback we added? #40

I just noticed the docs on rubydocs are not properly regenerated for some reason, so you might have missed that.

@thijsc
Copy link
Collaborator

thijsc commented Nov 14, 2018

Also some callbacks would definitively make sense to add, especially for partition assignment changes.

@mensfeld
Copy link
Member

It would be really good if the instrumentation engine was not AS Notif based but rather AS Notif compatbile so other engines can be plugged in (like dry-monitor that we use in Karafka)

@JHK
Copy link
Author

JHK commented Nov 14, 2018

@mensfeld I updated the ticket description to be more clear to not rely on ActiveSupport, but rather use the same interface for instrumentation.

@JHK
Copy link
Author

JHK commented Nov 16, 2018

The statistics endpoint goes into the right direction, but is not what I meant with this issue. It is about being able to connect the instrumentation e.g. to the datadog agent to be able to introspect what happened on each and every request (that got recorded). There it is quite handy to know which branch the code took, how often and what time it took.

@thijsc
Copy link
Collaborator

thijsc commented Nov 16, 2018

I've been thinking about this quite a bit, especially since I work on a monitoring product all day.

The thing is that I'm not sure there actually is something to measure. Librdkafka does a lot of buffering in the background. Actually consuming a message from Ruby pops something of an internal buffer, which is always super fast. I think what you're talking about mainly happens inside librdkafka. The stats for that are present in the statistics callback.

Can you give an example of where you'd like to see hooks? What would these hooks really allow you to measure?

@JHK
Copy link
Author

JHK commented Nov 19, 2018

Looking at the instrumentation of ruby-kafka it provides a notification one can subscribe to whenever a message produce gets called. It provides some meta information (code).

This can then be used for example in the datadog-agent or (like in my case) to time_bandits to determine the call frequency per request or similar metrics.

@thijsc
Copy link
Collaborator

thijsc commented Nov 19, 2018

Right, I think I understand the use case better. You're not so much interested in the performance of the produce call. But you do want to get hooks and see the volume?

@mensfeld
Copy link
Member

@thijsc I am interested in the produce performance. Having the instrumentation for it would allow also for the volume at least for DD using the increment over the messages sent to a particular topic.

@thijsc
Copy link
Collaborator

thijsc commented Nov 19, 2018

I am interested in the produce performance.

What do you see yourself measuring exactly?

@mensfeld
Copy link
Member

What do you see yourself measuring exactly?

How many messages can I send per second depending on the ack level plus where do they go (to which topic).

@mensfeld
Copy link
Member

mensfeld commented Aug 8, 2019

@thijsc any reason for the statistics_callback to be global? What if I would want to have different callback handling in various consumers/producers?

@thijsc thijsc added this to the Feature complete milestone Aug 15, 2019
@thijsc thijsc self-assigned this Aug 15, 2019
@thijsc
Copy link
Collaborator

thijsc commented Aug 15, 2019

@thijsc any reason for the statistics_callback to be global? What if I would want to have different callback handling in various consumers/producers?

#82 was opened for this question.

@thijsc
Copy link
Collaborator

thijsc commented Aug 15, 2019

I'm trying to get this done, but not making a lot of progress because I don't have a clear picture in my mind what this looks like. I can see how events for assignment changes and so forth can work.

I can also see how emitting an event for producing a message could work. I don't see how emitting an event for a delivered message would be useful. AS notifications assumes that things happen in sync, that's not going to be the case here. I think you're going to get a lot of out of order events.

I also don't see how we can do hooks for message delivery. The C lib pops them of a buffer, so when they arrive on the Ruby side says little on how the network is doing for example. The stats in the statistics callback do tell us that. Maybe I'm missing a useful use case here?

I think we need to spend some time coming up with a spec of which events should be emitted and write up some use cases on how one would benefit from them. That'll make it a more manageable project to get this done.

@JHK and @mensfeld which events do you think should be emitted and could you write up a short description of when they would trigger and which information they would emit?

@JHK
Copy link
Author

JHK commented Aug 19, 2019

I cannot say what exactly needs to be in such a message, but rather have a look at what racecar already provides:

Those are instrumentations built from the need to measure details within racecar. The statistics callback already provides a lot of those infos, but not the hook itself. So I'd suggest to include what makes sense to you in that hook. If one needs more, then we can still extend using individual PRs. But the general idea of hooks is present by then and the parameters can then be discussed on a case by case basis.

@dasch
Copy link

dasch commented Aug 23, 2019

We have a pretty clear need to measure then number of successful / failed message deliveries per producer process.

@mensfeld
Copy link
Member

@dasch but you can do that yourself now: https://github.com/karafka/waterdrop/pull/106/files#diff-d179c7dee2064c1622d2d3da2b03c44dR32

@thijsc
Copy link
Collaborator

thijsc commented Sep 18, 2019

Thanks all for the input! I'm going to work on it.

@emersonpriceiv
Copy link

Hello! I'm curious what became of this work. We're currently going through the process of updating Racecar and we've been leveraging the consumer heartbeat instrumentation for monitoring our consumer health. Are there any plans to implement something similar? If not we would love to see it!

@mensfeld
Copy link
Member

mensfeld commented Jun 5, 2021

@emersonpriceiv the current API allows you to do that. Please see the PR above for waterdrop where there's a full instrumentation support.

@thijsc
Copy link
Collaborator

thijsc commented Apr 24, 2023

Closing this one. I think it's not clear how we can improve on rdkafka's internal capabilities.

@thijsc thijsc closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants