New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vmstorage Dedup handles NaN #5587
Comments
See #5587 Signed-off-by: hagen1778 <roman@victoriametrics.com>
Hello @okzheng and thank you for a good question! This is expected behavior. Prometheus stale marker should be treated as any other value. Consider a case when vmagent was configured to scrape a target with 1m interval. On 4th minute vmagent reported a stale marker for this target. If user would configure the deduplication interval equal to 5min after this, then the stale marker should be preserved as the last value. As a workaround for your case, I'd propose disabling stale markers on vmagent side. |
Thanks for your persuasive explanation and proposal! @hagen1778 |
See #5587 Signed-off-by: hagen1778 <roman@victoriametrics.com>
Is your question request related to a specific component?
vmstorage
Describe the question in detail
I have 2 vmagent pods in a k8s deployment and these pods are configured to scrape metrics from the same targets, which include a service backed by two pods. A ServiceMonitor has been set up to discover and scrape these metrics.
At the same time, I configured relabeling in servicemonitor to drop different labels. So the metrics scraped from the two pods contain the same label-set.
However, I have observed an issue where random gaps appear in the metrics during a rolling update of the service.
I guess that when a pod is deleted, the original metrics will be marked as stale by vmagent, and the corresponding value and the metrics collected by the normal pod are not processed correctly in the deduplication logic.
Upon reading the deduplication function in lib/storage/dedup.go, I don't find any special handling of NaN. Then when NaN is compared with a meaningful value, it may cause the preserved value to be related to the order of arrival. This is inconsistent with the original intention of designing (perserving the maximum value when the timestamps are the same).
I'm not sure if my understanding is wrong or if it is the expected behavior.
Thanks for explaining it.
Troubleshooting docs
The text was updated successfully, but these errors were encountered: