New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gaps when plotting metrics in Grafana #4697
Comments
@hagen1778 Thanks, it looks like it's relevant. There's a few other linked issues too. How is VM calculating staleness? I remember it was fine when I first deployed VM and only became a problem when I tried to deploy multiple vmagents. The problem seems to have persisted even though I reverted that change days ago. Could VM just be hanging on to a bad staleness value? |
I would have to investigate, but I don't believe the data is becoming irregular. vmagent has plenty of resources and there are no reports of scrape errors or dropped metrics. I have a sneaky suspicion that it would resolve itself if I were to reset the vmstorage nodes. Regardless, this seems to be a major problem lots of people hit. VM should just work out of the box without this quirk or else I have a bit of a hard time recommending it as a Prometheus alternative. I'd really like to work together to figure out how to fix this. |
So, it looks like setting `search.minStalenessInterval to 5m fixed this. I am struggling to understand why this is needed though. The vmagent is configured to scrape every 30s and there are no reports of any scrape errors or dropped rows. The only exception is There should be no reason at all for victoriametrics to mark the metrics as stale. I'm happy to help provide more info if possible, but from what I can see this is purely a bug with how victoriametrics considers metrics stale. |
@uhthomas may I request the following info:
Thanks! |
I deployed a test cluster with one vmagent in an isolated namespace and let it run for a few hours. https://github.com/uhthomas/automata/tree/a025f2b501c9857d7ad92ce4ef5bc42439bd9bf5/k8s/unwind/vm4697 The results show the same gaps in the time series: Stale samples metric:
I've exported the time series as requested and thought it would be easier to upload it to GitHub than email it.
https://github.com/uhthomas/vm4697/blob/fca6deb33972389386e75c0b92fcbabd2275dcb8/vm4697.json Hope this helpful. |
Thanks for detailed info!
The gaps are there, indeed. So I checked the raw data via custom script which goes through timestamps and compares
Lines with So apparently there are more than 1 vmagent or other metrics collector in this setup and they do produce duplicates. Duplicates, as I mentioned above, skew the staleness detection algorithm: I decided to apply But still, there are gaps. Let's check what Now we see that step fluctuates between 15s and 45s but still not stable. However, it shows how staleness detection is affected by the raw data VM gets ingested with. Now let's get back to the question why raw data looks like this:
|
This is really interesting, thank you for investigating @hagen1778. I promise there is only one vmagent pushing to vm here. Could there be an issue with vmagent? Thanks for pointing out the dedupMinInterval, I have this set on both vmstorage and vmselect already and it works okay. So, to conclude, why is vmagent collecting metrics so inconsistently and producing duplicate data? As I mentioned before there is definitely only one vmagent active with the data I exported. There is only one |
It seems to be cAdvisor metrics which are a problem. The weirdest part is that the results of It really seems like vmagent is doing something wrong with consuming some cAdvisor metrics? |
ah! I should have remembered that...
where 1st word is a metric name, 2nd is its value on the moment of scrape. cadvisor returns:
where 3rd word is a timestamp when cadvisor last time cached the value. And vmagent respects this timestamp and stores it as is. This cache time on cadvisor side is controlled via So if vmagent scrapes cadvisor more frequently than cadvisor updates its cache - you get duplicated values. Or you get unstable step, because cadvisor is not synchronized with time when vmagent scrapes this. In Prometheus this didn't bother you, because Prometheus always set staleness interval to 5m, it is static. If you scrape metric each 5s and then it disappeared - Prometheus will continue plotting it for 5m. VM will stop plotting it in 10s. The opposite will happen with scrape intervals > 5m. So what you can do here is to manually set |
Yep, this makes sense. I was just looking into the cAdvisor implementation and realised the metrics are exported with explicit timestamps. I'm glad we were able to figure out what was happening and why, but this then leaves what can we do to fix it? I recognise the two solutions you proposed will work, but they are not the default behaviour. VictoriaMetrics should be able to handle this case without any additional configuration. |
I see no good way to solve this and I'm open to suggestions.
Should Prometheus have lower staleness default or cadvisor higher housekeeping default - everything will break. This won't be a problem for anyone if cadvisor will just give up exposing timestamps by default. I'd be glad to set |
I think unfortunately custom timestamps are here to stay and we just have to support it correctly. Many other projects also make use of it, like when forwarding metrics from other sources. Even the official Prometheus exports do this sometimes. I don't think I have enough domain knowledge to help with ideas on how to fix this. The only suggestion I have is to set the minScrapeInterval to 5m by default like Prometheus does. |
Forwarding metrics from other sources don't produce duplicates. It is unlikely an application will push the same time series with the same timestamp twice.
This exporter has it disabled by default. But anyway, this doesn't mean adding timestamp to pull model is the right thing to do. Take a look at this comment, for example.
This only hides the problem, isn't it? You'll still have duplicates with identical timestamps stored in TSDB. Or when you apply I think we should advocate for changing the cadvisor default behavior in the first place. And thanks for not giving up on this issue and bringing more light to this problem! |
@valyala it's not a |
#4697 Signed-off-by: hagen1778 <roman@victoriametrics.com>
#4697 Signed-off-by: hagen1778 <roman@victoriametrics.com>
FYI, starting from v1.93.0, single-node VictoriaMetrics and vmagent ignore timestamps provided by scrape targets by default unless the This should remove gaps from graphs built from metrics obtained from scrape targets, which improperly set timestamps for the exported metrics (such as cadvisor ). This should also reduce disk space usage for such metrics and increase query performance. Closing the issue as fixed then. |
… `honor_timestamps: true` is set explicitly This fixes the case with gaps for metrics collected from cadvisor, which exports invalid timestamps, which break staleness detection at VictoriaMetrics side. See #4697 , #4697 (comment) and #4697 (comment) Updates #1773
… `honor_timestamps: true` is set explicitly This fixes the case with gaps for metrics collected from cadvisor, which exports invalid timestamps, which break staleness detection at VictoriaMetrics side. See #4697 , #4697 (comment) and #4697 (comment) Updates #1773
FYI, the bugfix described here has been back-ported to v1.87.7 long-term support release. |
I know this issue is closed, but I hit a similar issue myself and actually root-caused it to vmagent inserting staleness markers if a series didn't have new data every scrape. Passing |
Staleness markers logic isn't related to handling of timestamps provided by scrape targets (aka |
@hagen1778 Should we discuss on #5576? I am running vmsingle, so vmselect is not relevant here. |
vmsingle reuses the same codebase as cluster. So both versions were vulnerable to the mentioned bug. |
Thanks, I will try v1.96.0 later. However, the vmsingle instance is not restarting at all during those times so I am not confident the issue will be fixed? |
You can try it on 1.95.1 with cache disabled (via GET param or by using vmui).
The mentioned bug isn't related to restarts; that was a coincidence. |
Ah, this does seem to fix it. Thank you. I will try the new version to verify later also. |
Describe the bug
I'm running a highly available VictoriaMetrics cluster and multiple VMAgents using the Kubernetes operator.
Following the guides on deduplication, I have the deduplication argument set as such:
Unfortunately, I see lots of gaps in the metrics.
Intuitively, it would seem the scrape intervals are mis-matched. I am pretty confident however this is not the case given:
To Reproduce
Deploy VM as I have done and observe the reported behaviour.
Version
docker.io/victoriametrics/vmstorage:v1.91.0-cluster
Logs
No response
Screenshots
No response
Used command-line flags
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: