-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sum(rate(metricselector[rangevector])) over-represents short lived timeseries #1215
Comments
That's one of the beset reports I've ever seen! Thank you! |
TL;DR: VictoriaMetrics calculates Let's look at the following time series
It contains VictoriaMetrics uses the following calculations:
where:
So VictoriaMetrics calculates the Prometheus uses similar formula for
Side note - Prometheus uses the same algorithm for I'd recommend using |
Thank you for looking at this bug, much appreciated. I don't believe it is due to the extrapolation change, which we don't really have a view on, and I think can be kept without the problem occurring. I think the tl;dr from the VM perspective is:
So far so good, except that this means that for range vectors with relatively stable metrics and short lived unstable metrics we cannot combine them : and this matters because we want to put recording rules into place around this to support SLO calculations, where we need the same units for all the covered metrics. Mathematically In Prometheus What would happen if instead, VictoriaMetrics took
I think this would not re-introduce the interpolation that is such a hot topic here, while correcting the dimensionality issue. There's probably another underlying issue in Regarding the Percona blog post - its an interesting post to be sure; when I read it the first time I wondered, and revisiting it I still do: fairly basic information theory tells us that we can't reconstruct a frequency higher than the Nyquist rate, so for a Gauge any lookback window less than 2x the sample frequency is just fabricating results: some sort of fitting to the actual data has to be taking place, but it can be an arbitrary polynomial underlying it so .... I'd love having a function to provide that inferred metric, and then calculate For counters a somewhat better story can be made, because negative values aren't permitted, but we're still inferring that no spike + reset occurred whenever we fill in data, so smoothing data in input to
The key thing we're looking for is keeping a comparable denominator in SLOs: this requires a recording rule that records the `rate` - the unit-change-per-second of the denominator and the numerator, of the separate clusters, which we can then combine. If `rate` behaves as this bug report demonstrates - with multiple order of magnitude swings - the denominators are not consistent and the fractions cannot be compared.
I agree that Prometheus always uses d :), and extrapolates as the edges (specifically, up to `1.1 * sampleInterval/(points-1))` - and sampleInterval is `t6-t1` in your example above: this won't affect these time series we're worried about here, since they are ones in the middle of the range, so we can discard that concern.
And yes, less than 2 samples means no `rate` - in the Prometheus model (functions must execute within the range vector given for performance) that seems logical, and its great that VictoriaMetrics has a more capable execution model, where a rate with samples adjacent to the range vector can be utilised to get a correct result. The shannon sampling theorem will kick in at some point though, but since we sample at a higher rate than 1/2 the range vectors we are using, thats not a problem for us (though I do wonder at folk who think they get accurate results when the laws of physics are in their way....)
Lets consider though the following table, using (mentally modelled - I haven't done a backfill test of this, though I can if you think different results would be obtained)... this is assuming duration to {start,end} is ~= 0 for t=10 and 11; and so we have just `averageDurationBetweenSamples/2` applying for the most part.
So, regarding Prometheus returning incomplete or incorrect results for increase, in the case of a timeseries that only exists for part of the time range - say increase[6]@t=6, we can see that it returns a mathematically plausible result, and the only time it might not is when the first sample is within that
So at 2 points right near d the maximum error is substantial, but capped at 3.2x: not orders of magnitude. The err for a For
or ~2%. So, out, but not terribly so. Drop the interval closer to the sampling rate, and it will get worse rapidly; expand the interval and it will get better. |
The proposed solution looks interesting:
Unfortunately the solution returns different results for time series starting inside the time range Re Percona blog post - they struggle with the usability and consistency issue in Prometheus. Many users are surprised when a part of graphs disappear after zooming in, while the rest of graphs remain visible. As you could guess, the disappeared graphs use TL;DR: Prometheus and VictoriaMetrics have different As I understood, |
Update: Prometheus folks decided to fix |
It is great to see the |
I think I just encountered this exact issue @rbtcollins. Thanks a lot for detailedly reporting it, helped a lot. in my case, the scraping job the Kubernetes apiservers was renamed, resulting in duplicated metrics for ~1min. apart from the spike at 10:15, the two series perfectly coincide. I will be going with |
I think we are experincing the same problem described here, but we are using We are using version v1.95.1 We have a metric endpoint for an application, that is scraped by prometheus and by victoria metrics. This is how they compare, on the same PromQL: Prometheus counts too low, due to lack of baseline with value of 0. With some experimentation we found, that addid a colon Again, I have no idea why. The query then looks like this: |
Hey @jenskloster! Could you try following https://docs.victoriametrics.com/Troubleshooting.html#unexpected-query-results and locating the exact series which producing the most outlying result? If yes, can you show on the range graph how it looks like? Can you share its raw data retrieved via /api/v1/export? |
Hi @hagen1778 This is the query we perform (copy pasted from grafana):
this produces the result of |
I'm sorry for late reply @jenskloster! And thanks for provided data! I've checked the data and noticed some time series have null values:
Moreover, this specific counter starts to fluctuate afterwards:
Counter should always grow or remain on the same value. This specific time series behaves like gauge. Do you know why it is so? |
Thanks @hagen1778, based on what you found, our data is the problem - not your product :) |
Closing this issue, since the |
We ran into a problem with SLI/SLO calculation using
sum(rate(good-selector[window]))/sum(rate(valid-selector[window))
where window is constant for any one calculation, but we'd either use it for a 28d rolling calculation for error budget management, or use the same formula in a narrower window for e.g. burn rate alerts.For one of our services that needs to report SLIs from the application itself rather than the relatively long lived load balancer metrics we found a discrepancy where (relatively) short lived metrics from each service pod resulted in skewed calculations.
Mathematically in PromQL - (and Victoria Metrics defers to the PromQL docs for its definition of rate insofar as this goes -) rate() is defined as 'average change per second of the metric over the range vector' : that is
increase(metric)/rangevector.duration.as_seconds()
.This is obviously only one possible definition. However, this definition has the important property that
increase(metricA[windowT]) > increase(metricB[windowT])
->rate(metricA[windowT]) > rate(metricB[windowT])
, and conversely as well.This doesn't seem to be the case with the current implementation in VictoriaMetrics.
Concretely, consider : given 15 long lived healthy pods for a service with constant rates R, and a single new pod that is unhealthy which lives for a total of 1 minutes, generates 1000x the normal rate during that period.
sum(rate(all16metrics[1m]))
at the end of that bad pods lifetime will return 1015R in both the prometheus model and the VictoriaMetrics: the 15 pods that are stable and exist beyond the range vector contribute 15R ; the bad pod contributes an increase of 1000R * 60 seconds / 60.sum(rate(all16metrics[1000m]))
at the end of that bad pods lifetime will return different results however.For Prometheus:
For VictoriaMetrics:
To Reproduce
Attached is a sample set of timeseries which were convenient when I was isolating the problem. The
metrics.json
file is a VictoriaMetrics JSON export; themetrics.tsdb
is an OpenMetrics exposition suitable for backfilling prometheus with to verify the difference behaviour.metrics.zip
Expected behavior
On the supplied data:
sum(increase(envoy_listener_http_downstream_rq_xx{envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0.0.0.0_8443",envoy_response_code_class="5",kubernetes_cluster!="cognitedata-development", service="ambassador", kubernetes_cluster="azure-dev"}[180d]))
= 873112.3640387581
sum(rate(envoy_listener_http_downstream_rq_xx{envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0.0.0.0_8443",envoy_response_code_class="5",kubernetes_cluster!="cognitedata-development", service="ambassador", kubernetes_cluster="azure-dev"}[180d]))
=
0.056141484313191756
873112.3640387581/15552000 = 0.056141484313191756
But in Victoria Metrics we get
1.2...
for the rate result.Screenshots
Version
1.58.0
The text was updated successfully, but these errors were encountered: