Skip to content

Add caching for the topics metadata#75

Merged
danielqsj merged 4 commits intodanielqsj:masterfrom
jorgelbg:metadata-cache
Jan 31, 2019
Merged

Add caching for the topics metadata#75
danielqsj merged 4 commits intodanielqsj:masterfrom
jorgelbg:metadata-cache

Conversation

@jorgelbg
Copy link
Copy Markdown
Contributor

This implements the feature mentioned in #74. Adding a configurable caching interval for the topics metadata. Only the metadata is cached, the offset/lag is requested/calculated everytime so this should not affect the metrics.

Depending on the interval configured (30s by default) could be the case that a new topic is added and is not picked up immediately.

@jorgelbg jorgelbg changed the title Metadata cache Add caching for the topics metadata Nov 14, 2018
@jorgelbg
Copy link
Copy Markdown
Contributor Author

Sorry for the big PR. I upgraded sarama to v1.19.0 because the client metadata was being cleared. Even if a larger RefreshFrequency was set.

Setting the RefreshFrequency (now that works) could render most of the other code not needed (already keeps an internal time counting twice. I'll update the PR with the change.

@jorgelbg
Copy link
Copy Markdown
Contributor Author

Setting the config.Metadata.RefreshFrequency could avoid the entire call to RefreshMetadata. But we would need to remove such call from Collect because e.client.RefreshMetadata() doesn't check the RefreshFrequency and executes the request every time.

@bojleros
Copy link
Copy Markdown

bojleros commented Jan 31, 2019

@jorgelbg do you think this can fix problem of having a gaps in dataseries ? I am now testing kafka-exporter in our staging environment and this happen all over again. What I have found is that this gaps occur in a similar timefreames when metadata gets refreshed (note that this io timeouts also happen occasionally):

[sarama] 2019/01/31 12:05:26 Failed to connect to broker kafka-sea-1:9092: dial tcp 172.20.x.x:9092: i/o timeout
[sarama] 2019/01/31 12:05:26 Failed to connect to broker kafka-sea-2:9092: dial tcp 172.20.x.x:9092: i/o timeout
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [xxx] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [yyy] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [zzz] from broker kafka-sea-1:9092
[sarama] 2019/01/31 12:05:26 client/metadata fetching metadata for [aaa] from broker kafka-sea-1:9092

image

For me it is a major issue that prevents me from using this exporter inside of our alerting chain. I've posted similar comments but see no activity for a longer time ...

@danielqsj
Copy link
Copy Markdown
Owner

looks good to me. Sorry for the delayed response.

@danielqsj
Copy link
Copy Markdown
Owner

Thanks @jorgelbg
@bojleros can you try it and see whether your issues are fixed?

@danielqsj danielqsj merged commit 56c1d47 into danielqsj:master Jan 31, 2019
@jorgelbg
Copy link
Copy Markdown
Contributor Author

@danielqsj don't worry, thanks for taking the time!

@bojleros this could help (I think) because it will reduce the additional time that refreshing the metadata causes, especially if you have a big cluster.

Increasing the scrape interval in prometheus could also help. Reading the lag/offset data and sending the response back could also be time consuming if you have a lot of topics.

@bojleros
Copy link
Copy Markdown

@danielqsj @jorgelbg

I am going to start my tests tomorrow morning.

Our clusters are rather small with tightly monitored performance. I am sure we have resources and not hitting fd limit. We do also have our own tool (https://github.com/msales/kage) running next to the kafka_exporters. It is how this gaps halted us before using kafka_exporter on production. I am not an kage developer but maybe you will be interested in comparing code of both projects in search for possible improvements.

Apart of that i saw that kafka exporter is producing a new values every each minute. Is it feasible to have it refreshing values each 15 seconds in order to fit our default scrape duration ?

Kind Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants