Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gaps when plotting metrics in Grafana #4697

Closed
uhthomas opened this issue Jul 23, 2023 · 36 comments
Closed

gaps when plotting metrics in Grafana #4697

uhthomas opened this issue Jul 23, 2023 · 36 comments
Assignees
Labels
bug Something isn't working vmagent

Comments

@uhthomas
Copy link
Contributor

Describe the bug

I'm running a highly available VictoriaMetrics cluster and multiple VMAgents using the Kubernetes operator.

Following the guides on deduplication, I have the deduplication argument set as such:

extraArgs: "dedup.minScrapeInterval": "30s"

Unfortunately, I see lots of gaps in the metrics.

image

Intuitively, it would seem the scrape intervals are mis-matched. I am pretty confident however this is not the case given:
image

To Reproduce

Deploy VM as I have done and observe the reported behaviour.

Version

docker.io/victoriametrics/vmstorage:v1.91.0-cluster

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

@uhthomas uhthomas added the bug Something isn't working label Jul 23, 2023
@Amper Amper self-assigned this Jul 24, 2023
@uhthomas
Copy link
Contributor Author

uhthomas commented Jul 24, 2023

Actually, I see this even without deduplication. I disabled deduplication and switched to using a single agent and still see this.

image

image

I don't see any logs from vminsert or vmstorage.

@uhthomas uhthomas changed the title gaps in deduplicated metrics gaps in metrics Jul 24, 2023
@hagen1778
Copy link
Collaborator

Hello @uhthomas! Could you please go through #3878 and see if it is relevant for your case?

hagen1778 added a commit that referenced this issue Jul 25, 2023
#3878
#4697
#3849
#3451
Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778 hagen1778 changed the title gaps in metrics gaps when plotting metrics in Grafana Jul 25, 2023
@hagen1778 hagen1778 assigned hagen1778 and unassigned Amper Jul 25, 2023
@uhthomas
Copy link
Contributor Author

uhthomas commented Jul 25, 2023

@hagen1778 Thanks, it looks like it's relevant. There's a few other linked issues too.

#1773 and #426

How is VM calculating staleness? I remember it was fine when I first deployed VM and only became a problem when I tried to deploy multiple vmagents. The problem seems to have persisted even though I reverted that change days ago. Could VM just be hanging on to a bad staleness value?

@hagen1778
Copy link
Collaborator

How is VM calculating staleness?

@uhthomas please see the change in b51436b. It suppose to address questions like this or other similar issues.

Could VM just be hanging on to a bad staleness value?

It is unlikely. I believe data just become irregular for some reason, which skew staleness detection.

@uhthomas
Copy link
Contributor Author

I would have to investigate, but I don't believe the data is becoming irregular. vmagent has plenty of resources and there are no reports of scrape errors or dropped metrics. I have a sneaky suspicion that it would resolve itself if I were to reset the vmstorage nodes.

Regardless, this seems to be a major problem lots of people hit. VM should just work out of the box without this quirk or else I have a bit of a hard time recommending it as a Prometheus alternative. I'd really like to work together to figure out how to fix this.

@uhthomas
Copy link
Contributor Author

So, it looks like setting `search.minStalenessInterval to 5m fixed this. I am struggling to understand why this is needed though. The vmagent is configured to scrape every 30s and there are no reports of any scrape errors or dropped rows.

image

The only exception is vm-operator, which I will fix.

image

There should be no reason at all for victoriametrics to mark the metrics as stale. I'm happy to help provide more info if possible, but from what I can see this is purely a bug with how victoriametrics considers metrics stale.

@hagen1778
Copy link
Collaborator

@uhthomas may I request the following info:

  1. Ensure you're using a single instance of vmagent (for better isolation)
  2. Along with data gaps on the panel, plot rate(vm_promscrape_stale_samples_created_total) expression and attach the screenshot
  3. Locate one time series which has gaps when plotted, export it via /api/v1/export API on a limited time range (approx couple of hours where issue can be reproduced) with reduce_mem_usage=1 query param and send it to roman@victoriametrics.com. Feel free to obfuscate sensitive data - I'm interested only in reproducibility

Thanks!

@uhthomas
Copy link
Contributor Author

Sure, let me set some stuff up and I'll get back to you.

fwiw, the orange in this screenshot I believe is with one vmagent and without any changes to min staleness.

image

uhthomas added a commit to uhthomas/automata that referenced this issue Jul 26, 2023
@uhthomas
Copy link
Contributor Author

uhthomas commented Jul 26, 2023

I deployed a test cluster with one vmagent in an isolated namespace and let it run for a few hours.

https://github.com/uhthomas/automata/tree/a025f2b501c9857d7ad92ce4ef5bc42439bd9bf5/k8s/unwind/vm4697

The results show the same gaps in the time series:

image

image

Stale samples metric:

rate(vm_promscrape_stale_samples_created_total{prometheus="vm4697/vm4697"})

image

I've exported the time series as requested and thought it would be easier to upload it to GitHub than email it.

❯ curl http://localhost:8481/select/0/prometheus/api/v1/export -d 'match[]=container_memory_working_set_bytes&reduce_mem_usage=1' > vm4697.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 63.5M    0 63.5M  100    61  50.1M     48  0:00:01  0:00:01 --:--:-- 50.2M

https://github.com/uhthomas/vm4697/blob/fca6deb33972389386e75c0b92fcbabd2275dcb8/vm4697.json

Hope this helpful.

valyala pushed a commit that referenced this issue Jul 26, 2023
#3878
#4697
#3849
#3451
Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778
Copy link
Collaborator

hagen1778 commented Jul 27, 2023

Thanks for detailed info!
I used the exported data from here https://github.com/uhthomas/vm4697/blob/fca6deb33972389386e75c0b92fcbabd2275dcb8/vm4697.json and selected one specific time series with gaps when plotted:


container_memory_working_set_bytes{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cluster="unwind",container="ceph-crash",feature_node_kubernetes_io_cpu_cpuid_AESNI="true",feature_node_kubernetes_io_cpu_cpuid_AVX="true",feature_node_kubernetes_io_cpu_cpuid_AVX2="true",feature_node_kubernetes_io_cpu_cpuid_CMPXCHG8="true",feature_node_kubernetes_io_cpu_cpuid_FMA3="true",feature_node_kubernetes_io_cpu_cpuid_FXSR="true",feature_node_kubernetes_io_cpu_cpuid_FXSROPT="true",feature_node_kubernetes_io_cpu_cpuid_LAHF="true",feature_node_kubernetes_io_cpu_cpuid_MOVBE="true",feature_node_kubernetes_io_cpu_cpuid_OSXSAVE="true",feature_node_kubernetes_io_cpu_cpuid_SYSCALL="true",feature_node_kubernetes_io_cpu_cpuid_SYSEE="true",feature_node_kubernetes_io_cpu_cpuid_VMX="true",feature_node_kubernetes_io_cpu_cpuid_X87="true",feature_node_kubernetes_io_cpu_cpuid_XSAVE="true",feature_node_kubernetes_io_cpu_cpuid_XSAVEOPT="true",feature_node_kubernetes_io_cpu_hardware_multithreading="true",feature_node_kubernetes_io_cpu_model_family="6",feature_node_kubernetes_io_cpu_model_id="63",feature_node_kubernetes_io_cpu_model_vendor_id="Intel",instance="talos-avz-rb5",job="kubelet",namespace="rook-ceph",node="talos-avz-rb5",pod="rook-ceph-crashcollector-talos-avz-rb5-7d57c857db-zhv8m"}
image

The gaps are there, indeed. So I checked the raw data via custom script which goes through timestamps and compares TimeStamp N to TimeStamp N-1. Here's sample of the script output

step 0s
step 26s
step 0s
step 18s
step 0s
step 35s

Lines with step 0s basically means that TimeStamp N == TimeStamp N-1. It is a duplicate.
With my eyes, I see duplicates in the raw data:
image

So apparently there are more than 1 vmagent or other metrics collector in this setup and they do produce duplicates. Duplicates, as I mentioned above, skew the staleness detection algorithm:
image

I decided to apply -dedup.minScrapeInterval=1ms to remove the duplicates - there are much less gaps now:
image

But still, there are gaps. Let's check what scrape_interval reports now:
image

Now we see that step fluctuates between 15s and 45s but still not stable. However, it shows how staleness detection is affected by the raw data VM gets ingested with.

Now let's get back to the question why raw data looks like this:

  1. From the exported data it is clear there are more than 1 vmagent scraping the same targets. I see you set dedupMinInterval on storage level, but you also should set it on vmselect level - see https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#deduplication
  2. Rate of rate(vm_promscrape_stale_samples_created_total{prometheus="vm4697/vm4697"}) shows frequent changes. Stale sample is a mark generated by vmagent or Prometheus when target goes off (becomes unreachable). This mark is needed to skip staleness detection and immediately interrupt the line continuation for the target metrics. From what I see, targets are unstable - otherwise metric vm_promscrape_stale_samples_created_total won't change so frequently. If you don't vmagent to react on these events - you can disable this behavior by passing -promscrape.noStaleMarkers flag to vmagent.

@uhthomas
Copy link
Contributor Author

uhthomas commented Jul 27, 2023

This is really interesting, thank you for investigating @hagen1778. I promise there is only one vmagent pushing to vm here. Could there be an issue with vmagent?

Thanks for pointing out the dedupMinInterval, I have this set on both vmstorage and vmselect already and it works okay.

So, to conclude, why is vmagent collecting metrics so inconsistently and producing duplicate data? As I mentioned before there is definitely only one vmagent active with the data I exported. There is only one VMServiceScrape object, and the target discovery from the vmagent shows there are no duplicates.

image

@uhthomas
Copy link
Contributor Author

For what it's worth, I do observe the same behaviour with scrape_interval for that metric.

image

Other metrics seem perfect, though...

image

@uhthomas
Copy link
Contributor Author

uhthomas commented Jul 27, 2023

It seems to be cAdvisor metrics which are a problem.

NodeScrape config

image

The weirdest part is that the results of scrape_interval for individual cAdvisor metrics aren't even consistent. All the graphs look different, and some look perfect.

image

It really seems like vmagent is doing something wrong with consuming some cAdvisor metrics?

@hagen1778
Copy link
Collaborator

hagen1778 commented Jul 27, 2023

ah! I should have remembered that...
cadvisor sets timestamps to the exported time series. While 99% of the exporters return you the following data:

metric1 10 
metric2 25

where 1st word is a metric name, 2nd is its value on the moment of scrape.

cadvisor returns:

metric1 10 1690480863
metric2 25 1690480863

where 3rd word is a timestamp when cadvisor last time cached the value. And vmagent respects this timestamp and stores it as is. This cache time on cadvisor side is controlled via housekeeping_interval param.

So if vmagent scrapes cadvisor more frequently than cadvisor updates its cache - you get duplicated values. Or you get unstable step, because cadvisor is not synchronized with time when vmagent scrapes this.

In Prometheus this didn't bother you, because Prometheus always set staleness interval to 5m, it is static. If you scrape metric each 5s and then it disappeared - Prometheus will continue plotting it for 5m. VM will stop plotting it in 10s. The opposite will happen with scrape intervals > 5m.

So what you can do here is to manually set search.minStalenessInterval to bigger interval and it will behave like Prometheus. Or ignore cadvisor timestamps by setting honor_timestamps: false in promscrape config for this target on vmagent side.

@uhthomas
Copy link
Contributor Author

Yep, this makes sense. I was just looking into the cAdvisor implementation and realised the metrics are exported with explicit timestamps.

I'm glad we were able to figure out what was happening and why, but this then leaves what can we do to fix it? I recognise the two solutions you proposed will work, but they are not the default behaviour. VictoriaMetrics should be able to handle this case without any additional configuration.

@hagen1778
Copy link
Collaborator

hagen1778 commented Jul 27, 2023

VictoriaMetrics should be able to handle this case without any additional configuration.

I see no good way to solve this and I'm open to suggestions.
So far I recognize this behavior of cadvisor as incorrect. Exposing timestamps in pull model is a bad practice. Consider the following issues:

Should Prometheus have lower staleness default or cadvisor higher housekeeping default - everything will break. This won't be a problem for anyone if cadvisor will just give up exposing timestamps by default.

I'd be glad to set honor_timestamps on vmagent side as false by default, but this can't be done for compatibility reasons with Prometheus. So far I'm going to document this detail, so other users will figure out the problem faster.

@uhthomas
Copy link
Contributor Author

I think unfortunately custom timestamps are here to stay and we just have to support it correctly. Many other projects also make use of it, like when forwarding metrics from other sources. Even the official Prometheus exports do this sometimes.

https://github.com/prometheus/influxdb_exporter/blob/4d655874d2f77c2e00ca09cd3e561cfc92fec126/main.go#L289-L291

I don't think I have enough domain knowledge to help with ideas on how to fix this. The only suggestion I have is to set the minScrapeInterval to 5m by default like Prometheus does.

@hagen1778
Copy link
Collaborator

Many other projects also make use of it, like when forwarding metrics from other sources

Forwarding metrics from other sources don't produce duplicates. It is unlikely an application will push the same time series with the same timestamp twice.
Dynamic staleness detection was added to VictoriaMetrics specifically because of native push model support. Because VM had to figure out how to detect stale series incoming from various sources with various resolution. Staleness is calculated for each time series independently which makes VM usable as a central TSDB which could store and process time series no matter if they have millisecond (high-frequency data, like one produced by netdata) or hour resolution (heavy recording rules calculated periodically or cronjobs).
Ofc, Prometheus allows you to configure staleness interval as well. But it does it globally. As VM is supposed to be used as a central storage - there are better approaches to follow.

Even the official Prometheus exports do this sometimes.

This exporter has it disabled by default. But anyway, this doesn't mean adding timestamp to pull model is the right thing to do. Take a look at this comment, for example.

The only suggestion I have is to set the minScrapeInterval to 5m by default like Prometheus does.

This only hides the problem, isn't it? You'll still have duplicates with identical timestamps stored in TSDB. Or when you apply rate in Prometheus it will capture an inconsistent number of data points for calculation, because the resolution of the metric is spiky. The thing that Prometheus doesn't plot gaps in this case only hides the real problem.

I think we should advocate for changing the cadvisor default behavior in the first place.
I'll try to bring more attention to this topic among the community and VM teammates, since my opinion could be different to others.

And thanks for not giving up on this issue and bringing more light to this problem!

valyala added a commit that referenced this issue Jul 29, 2023
valyala added a commit that referenced this issue Jul 29, 2023
@f41gh7
Copy link
Contributor

f41gh7 commented Jul 29, 2023

@f41gh7 and @Amper, could you check whether cadvisor is scraped with honor_timestamps: false in VictoriaMetrics operator and VictoriaMetrics helm charts? If this isn't the case, then it must be fixed - see this comment for details.

grep -r 'honor_timestamps' helm-charts/charts/ - shows nothing.

@valyala it's not a BUGFIX. It's a breaking change that we must mention at changelog. It may break users configuration.

hagen1778 added a commit that referenced this issue Aug 10, 2023
#4697
Signed-off-by: hagen1778 <roman@victoriametrics.com>
@hagen1778
Copy link
Collaborator

@f41gh7 see the update note here c36259f

valyala added a commit that referenced this issue Aug 11, 2023
valyala pushed a commit that referenced this issue Aug 11, 2023
#4697
Signed-off-by: hagen1778 <roman@victoriametrics.com>
valyala added a commit that referenced this issue Aug 11, 2023
@valyala
Copy link
Collaborator

valyala commented Aug 12, 2023

FYI, starting from v1.93.0, single-node VictoriaMetrics and vmagent ignore timestamps provided by scrape targets by default unless the honor_timestamps: true option is set at the corresponding scrape_configs.

This should remove gaps from graphs built from metrics obtained from scrape targets, which improperly set timestamps for the exported metrics (such as cadvisor ). This should also reduce disk space usage for such metrics and increase query performance. Closing the issue as fixed then.

@valyala valyala closed this as completed Aug 12, 2023
valyala added a commit that referenced this issue Aug 12, 2023
… `honor_timestamps: true` is set explicitly

This fixes the case with gaps for metrics collected from cadvisor,
which exports invalid timestamps, which break staleness detection at VictoriaMetrics side.

See #4697 ,
#4697 (comment)
and #4697 (comment)

Updates #1773
valyala added a commit that referenced this issue Aug 12, 2023
… `honor_timestamps: true` is set explicitly

This fixes the case with gaps for metrics collected from cadvisor,
which exports invalid timestamps, which break staleness detection at VictoriaMetrics side.

See #4697 ,
#4697 (comment)
and #4697 (comment)

Updates #1773
@valyala
Copy link
Collaborator

valyala commented Aug 12, 2023

FYI, the bugfix described here has been back-ported to v1.87.7 long-term support release.

@GrahamDennis
Copy link

GrahamDennis commented Sep 24, 2023

I know this issue is closed, but I hit a similar issue myself and actually root-caused it to vmagent inserting staleness markers if a series didn't have new data every scrape. Passing -promscrape.noStaleMarkers to vmagent resolved the problem for me. This feels like a better solution here (or making vmagent a little less aggressive at inserting staleness markers by default) than discarding timestamps from source systems?

@valyala
Copy link
Collaborator

valyala commented Oct 3, 2023

This feels like a better solution here (or making vmagent a little less aggressive at inserting staleness markers by default) than discarding timestamps from source systems?

Staleness markers logic isn't related to handling of timestamps provided by scrape targets (aka honor_timestamps option). Currently vmagent follows Prometheus logic for handling staleness markers as described here. Probably, this logic could be made less aggressive by default.

@uhthomas
Copy link
Contributor Author

Hmm, I still see some strangeness with v1.95.1. I am using VMSingle and don't see any reason this would happen. There is a single vmagent.

image

image

@uhthomas
Copy link
Contributor Author

These metrics don't seem to export timestamps.

image

@hagen1778
Copy link
Collaborator

Hey @uhthomas! Could this be related to #5388 ?
Could you try running your queries with cache disabled? Or try upgrading vmselect to 1.96?

@uhthomas
Copy link
Contributor Author

@hagen1778 Should we discuss on #5576?

I am running vmsingle, so vmselect is not relevant here.

@hagen1778
Copy link
Collaborator

I am running vmsingle, so vmselect is not relevant here.

vmsingle reuses the same codebase as cluster. So both versions were vulnerable to the mentioned bug.

@uhthomas
Copy link
Contributor Author

I am running vmsingle, so vmselect is not relevant here.

vmsingle reuses the same codebase as cluster. So both versions were vulnerable to the mentioned bug.

Thanks, I will try v1.96.0 later. However, the vmsingle instance is not restarting at all during those times so I am not confident the issue will be fixed?

@hagen1778
Copy link
Collaborator

Thanks, I will try v1.96.0 later

You can try it on 1.95.1 with cache disabled (via GET param or by using vmui).

the vmsingle instance is not restarting at all

The mentioned bug isn't related to restarts; that was a coincidence.

@uhthomas
Copy link
Contributor Author

Thanks, I will try v1.96.0 later

You can try it on 1.95.1 with cache disabled (via GET param or by using vmui).

the vmsingle instance is not restarting at all

The mentioned bug isn't related to restarts; that was a coincidence.

Ah, this does seem to fix it. Thank you. I will try the new version to verify later also.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmagent
Projects
None yet
Development

No branches or pull requests

6 participants