Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum(rate(metricselector[rangevector])) over-represents short lived timeseries #1215

Closed
rbtcollins opened this issue Apr 14, 2021 · 14 comments
Closed
Assignees
Labels
question The question issue

Comments

@rbtcollins
Copy link

rbtcollins commented Apr 14, 2021

We ran into a problem with SLI/SLO calculation using sum(rate(good-selector[window]))/sum(rate(valid-selector[window)) where window is constant for any one calculation, but we'd either use it for a 28d rolling calculation for error budget management, or use the same formula in a narrower window for e.g. burn rate alerts.

For one of our services that needs to report SLIs from the application itself rather than the relatively long lived load balancer metrics we found a discrepancy where (relatively) short lived metrics from each service pod resulted in skewed calculations.

Mathematically in PromQL - (and Victoria Metrics defers to the PromQL docs for its definition of rate insofar as this goes -) rate() is defined as 'average change per second of the metric over the range vector' : that is increase(metric)/rangevector.duration.as_seconds().

This is obviously only one possible definition. However, this definition has the important property that increase(metricA[windowT]) > increase(metricB[windowT]) -> rate(metricA[windowT]) > rate(metricB[windowT]), and conversely as well.

This doesn't seem to be the case with the current implementation in VictoriaMetrics.

Concretely, consider : given 15 long lived healthy pods for a service with constant rates R, and a single new pod that is unhealthy which lives for a total of 1 minutes, generates 1000x the normal rate during that period.
sum(rate(all16metrics[1m])) at the end of that bad pods lifetime will return 1015R in both the prometheus model and the VictoriaMetrics: the 15 pods that are stable and exist beyond the range vector contribute 15R ; the bad pod contributes an increase of 1000R * 60 seconds / 60.

sum(rate(all16metrics[1000m])) at the end of that bad pods lifetime will return different results however.
For Prometheus:

  • the rate of the stable pods is constant, so 15R.
  • The short lived pods rate is calculated over the entire range vector, so 1000R*60/1000 = 60R
  • for a total of 75R.
    For VictoriaMetrics:
  • the rate of the stable pods is constant, so 15R.
  • The short lived pods rate is calculated from the start of the metric to the end of the metric, so 1000R*60/60 = 1000R
  • for a total of 1015R

To Reproduce
Attached is a sample set of timeseries which were convenient when I was isolating the problem. The metrics.json file is a VictoriaMetrics JSON export; the metrics.tsdb is an OpenMetrics exposition suitable for backfilling prometheus with to verify the difference behaviour.
metrics.zip

Expected behavior
On the supplied data:
sum(increase(envoy_listener_http_downstream_rq_xx{envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0.0.0.0_8443",envoy_response_code_class="5",kubernetes_cluster!="cognitedata-development", service="ambassador", kubernetes_cluster="azure-dev"}[180d]))
= 873112.3640387581
sum(rate(envoy_listener_http_downstream_rq_xx{envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0.0.0.0_8443",envoy_response_code_class="5",kubernetes_cluster!="cognitedata-development", service="ambassador", kubernetes_cluster="azure-dev"}[180d]))
= 0.056141484313191756

873112.3640387581/15552000 = 0.056141484313191756

But in Victoria Metrics we get 1.2... for the rate result.

Screenshots

Version
1.58.0

@hagen1778 hagen1778 added the question The question issue label Apr 22, 2021
@hagen1778
Copy link
Collaborator

That's one of the beset reports I've ever seen! Thank you!
I lack knowledge of metricsql engine, so will refer to @valyala's help.

@valyala valyala self-assigned this Apr 30, 2021
@valyala
Copy link
Collaborator

valyala commented May 28, 2021

TL;DR: VictoriaMetrics calculates rate() and increase() slightly differently than Prometheus because Prometheus may return unexpected results in some cases mentioned in the MetricsQL docs.

Let's look at the following time series m:

v1   v2   v3   v4   v5   v6
t1   t2   t3   t4   t5   t6
        ^             ^
        |  <-- d -->  |
                      t

It contains (v_i, t_i) samples. Let's calculate rate(m[d]) at a timestamp t with the lookbehind window d.

VictoriaMetrics uses the following calculations:

 rate(m[d]) = (vCurr - vPrev) / (tCurr - tPrev)

where:

  • vCurr is the real sample value for the closest point before the given timestamp t. This is v5 in the example above.
  • vPrev is the real sample value for the closest point before the timestamp t - d. This is v2 in the example above.
  • tCurr is the timestamp for vCurr. This is t5 in the example above.
  • tPrev is the timestamp for vPrev. This is t2 in the example above.

So VictoriaMetrics calculates the rate(m[d]) at a timestamp t as (v5 - v2) / (t5 - t2). If a time series starts in the middle of (t - d ... t], then the first sample from the time series is used as (vPrev, tPrev). For example, if v1 and v2 samples are missing, then the rate is calculated as (v5 - v3) / (t5 - t3). This allows returning the expected (real) rate starting from the beginning of the time series, and this rate doesn't depend on d. Prometheus in this case returns incorrect constantly increasing rate on the time range [t3 ... t3 + d] and starts returning correct rate only after t3 + d.

Prometheus uses similar formula for rate() calculations with the following exceptions:

  • It uses the first real sample on the time range (t-d ... t] for vPrev selection. So it selects v3 in the example above. This makes impossible calculating rate() on (t-d ... t] intervals with less than two samples. So Prometheus returns an empty graph if d is smaller than 2x scrape_interval. It also completely misses series changes between v2 and v3. This is not what most users expect. See https://www.percona.com/blog/2020/02/28/better-prometheus-rate-function-with-victoriametrics/ for details.

  • It may extrapolate vCurr and vPrev under certain conditions, which can change the result in an unexpected way.

  • It always uses d instead of (tCurr - tPrev). E.g. rate(m[d]) = (vCurr - vPrev) / d. This obviously leads to unexpected incorrect results if a time series starts in the middle of (t-d ... t]) interval.

Side note - Prometheus uses the same algorithm for increase() calculations except of the last step. This means that Prometheus returns incomplete / unexpected / incorrect results from the increase() in the majority of cases. See prometheus/prometheus#3746 . See the corresponding source code.

I'd recommend using increase() instead of rate() in the original query. E.g. to substitute sum(rate(...)) with sum(increase(...)). This should give the expected results. The increase() function works OK in most cases, but sometimes it may return unexpected results. For instance, if a time series starts with a big value and you want catching it in increase() results. In this case just use increase_pure() function from MetricsQL.

@rbtcollins
Copy link
Author

Thank you for looking at this bug, much appreciated. I don't believe it is due to the extrapolation change, which we don't really have a view on, and I think can be kept without the problem occurring.

I think the tl;dr from the VM perspective is:

  • with rate metrics are always divided by a divisor that represents the period of time the data was gathered over, always in units/second, with some caveats around cases where a metric exists longer than the range vector (d in your description).

So far so good, except that this means that for range vectors with relatively stable metrics and short lived unstable metrics we cannot combine them : and this matters because we want to put recording rules into place around this to support SLO calculations, where we need the same units for all the covered metrics.

Mathematically sum(v@t-28days..v@t)/28day-seconds is not the same dimension as sum(v@t-5days..v@t-2days)/3day-seconds, and the rate definition in VictoriaMetrics is producing both these dimensions in a single function call.

In Prometheus sum_over_time(irate(m[28d]))/count_over_time(irate(m[28d])) probably has the same dimension as rate in VictoriaMetrics.

What would happen if instead, VictoriaMetrics took

rate(m[d]) = (tCurr - tPrev) > d  ? (vCurr - vPrev) / (tCurr - tPrev) : increase(m[d])/d

I think this would not re-introduce the interpolation that is such a hot topic here, while correcting the dimensionality issue.

There's probably another underlying issue in increase but I haven't gotten close enough to your redefinition of it to evaluate the situation.

Regarding the Percona blog post - its an interesting post to be sure; when I read it the first time I wondered, and revisiting it I still do: fairly basic information theory tells us that we can't reconstruct a frequency higher than the Nyquist rate, so for a Gauge any lookback window less than 2x the sample frequency is just fabricating results: some sort of fitting to the actual data has to be taking place, but it can be an arbitrary polynomial underlying it so .... I'd love having a function to provide that inferred metric, and then calculate rate on it: so that users have control over speculation vs concrete behaviour.

For counters a somewhat better story can be made, because negative values aren't permitted, but we're still inferring that no spike + reset occurred whenever we fill in data, so smoothing data in input to rate() seems risky. I'm not entirely clear what the problem Percona had, but it seems like clamping the upper resolution of the range-vector to 2x the sample frequency would guarantee a function result as long as prom had data, while still permitting arbitrarily fine grained zoom in of the graph: its still well defined even if you zoom in so far only 2 samples are shown in the window.

Rambles below are really about my cross-checking my understanding on the interpolation error magnitude - I hope thats ok!
The key thing we're looking for is keeping a comparable denominator in SLOs: this requires a recording rule that records the `rate` - the unit-change-per-second of the denominator and the numerator, of the separate clusters, which we can then combine. If `rate` behaves as this bug report demonstrates - with multiple order of magnitude swings - the denominators are not consistent and the fractions cannot be compared. I agree that Prometheus always uses d :), and extrapolates as the edges (specifically, up to `1.1 * sampleInterval/(points-1))` - and sampleInterval is `t6-t1` in your example above: this won't affect these time series we're worried about here, since they are ones in the middle of the range, so we can discard that concern. And yes, less than 2 samples means no `rate` - in the Prometheus model (functions must execute within the range vector given for performance) that seems logical, and its great that VictoriaMetrics has a more capable execution model, where a rate with samples adjacent to the range vector can be utilised to get a correct result. The shannon sampling theorem will kick in at some point though, but since we sample at a higher rate than 1/2 the range vectors we are using, thats not a problem for us (though I do wonder at folk who think they get accurate results when the laws of physics are in their way....) Lets consider though the following table, using (mentally modelled - I haven't done a backfill test of this, though I can if you think different results would be obtained)... this is assuming duration to {start,end} is ~= 0 for t=10 and 11; and so we have just `averageDurationBetweenSamples/2` applying for the most part.
t v t@d increase[6] rate[6]
0 - -6 - -
1 - -5 - -
2 - -4 - -
3 - -3 - -
4 0 -2 - -
5 1 -1 =(1-0)*1/1= =1.5/6
6 2 0 =(2-0)*2.5/2=2.5 = 2.5/6
7 3 1 =(3-0)*3.5/3=3.5 = 3.5/6
8 4 2 =(4-0)*4.5/4=4.5 = 4.5/6
9 5 3 =(5-0)*5.5/5=5.5 = 5.5/6
10 6 4 =(6-0)*6/6=6 = 6/6
11 7 5 =(7-1)*6/6=6.5 = 6/6

So, regarding Prometheus returning incomplete or incorrect results for increase, in the case of a timeseries that only exists for part of the time range - say increase[6]@t=6, we can see that it returns a mathematically plausible result, and the only time it might not is when the first sample is within that 1.1*sampleInterval of the edge of d: and there the error is bounded:

newIncrease = resultValue * (extrapolateToInterval / sampledInterval) 
newIncrease = resultValue * (extrapolateToInterval / (tlast-tfirst))  
# durationToStart just under extrapolationThreshold is larger than 
# averageDurationBetweenSamples/2; and both start and end)
newIncrease = resultValue * ((origExtrapolateToInterval + 2* 1.1 * ((tlast-tfirst)/len(points-1))) / (tlast-tfirst))  
newIncrease = resultValue * 
((tlast-tfirst)+ 2* 1.1 * ((tlast-tfirst)/(len(points)-1))) / (tlast-tfirst)
newIncrease = resultValue * 
(
    ((len(points)-1)(tlast-tfirst) + 2* 1.1 * (tlast-tfirst)) / (len(points)-1)    ) / (tlast-tfirst)

newIncrease = resultValue * ( (tlast-tfirst)(len(points) + 1.2) / (len(points)-1)    ) / (tlast-tfirst)
newIncrease = resultValue * ( (len(points) + 1.2) / (len(points)-1)    ) 

errRatio = newIncrease/resultValue = ( (len(points) + 1.2) / (len(points)-1)    ) 
lim errRatio points->2 = (2 + 1.2)/1 = 3.2
lim errRatio points -> &infin; = &infin;/&infin; = 1

So at 2 points right near d the maximum error is substantial, but capped at 3.2x: not orders of magnitude.
As the number of points increases, the error ratio approaches 1.

The err for a rate is obviously then tied to the sampling frequency: the maximum error with [d] at 2 samples whose distance to the border is within 1.1 times the distance between the samples: d < 3.2(tlast-tfirst). ; With a 15s frequency, this can occur at a d as small as [45s].

For rate on [1m], probably fairly common, points=3, and average intersample gap of 15s. Start and end will be within the threshold to get durationToStart as the adjustment:

increaseErrRatio =  1.33 # tlast-tstart  =45s; extrapolateToInterval=60s; 60/45=4/3
rateErrRatio = 4/3 / 60 = 1/45

or ~2%.

So, out, but not terribly so.

Drop the interval closer to the sampling rate, and it will get worse rapidly; expand the interval and it will get better.

@valyala
Copy link
Collaborator

valyala commented Jun 9, 2021

The proposed solution looks interesting:

rate(m[d]) = (tCurr - tPrev) > d  ? (vCurr - vPrev) / (tCurr - tPrev) : increase(m[d])/d

Unfortunately the solution returns different results for time series starting inside the time range (t-d ... t]. The results depend on both d and t values. For example, if a time series m has the values Vti = K*ti , while t0 belongs to (t-d ... t], then rate(m[d]) won't match K on the time range (t0 ... t0+d] as most users expect. Instead, it will vary from 0 to K on this time range. This may result in false alerts for rate(m[d]) < threshold. This may also result in hard to understand graphs.

Re Percona blog post - they struggle with the usability and consistency issue in Prometheus. Many users are surprised when a part of graphs disappear after zooming in, while the rest of graphs remain visible. As you could guess, the disappeared graphs use rate(). That's why Prometheus folks designed an ugly hack and put it into Grafana instead of fixing the original issue in Prometheus.

TL;DR: Prometheus and VictoriaMetrics have different rate() and increase() implementations because they solve different issues users may face. Unfortunately it is impossible to solve all the issues in a single implementation :( So I'd recommend using sum(increase(m[d])) instead of sum(rate(m[d])) if d significantly exceeds the expected lifetime for m series.

As I understood, envoy_listener_http_downstream_rq_xx is a counter exposing the number of bytes (or packets) received by a particular listener. Then it is natural calculating the summary number of bytes received by all the listeners during the last 180 days with sum(increase(envoy_listener_http_downstream_rq_xx[180d])). On the other hand it isn't very clear what does sum(rate(envoy_listener_http_downstream_rq_xx[180d])) mean for series shorter than 180 days. If you need the average receive rate for all the listeners, then it would be better using more clear query: sum(increase(envoy_listener_http_downstream_rq_xx[180d])) / (180*24*3600). But IHMO this complicates the meaning of the query (or recording rule). So, from simplicity PoV, I'd prefer just sum(increase(envoy_listener_http_downstream_rq_xx[180d])).

@valyala
Copy link
Collaborator

valyala commented Jan 23, 2022

Update: Prometheus folks decided to fix rate() and increase() implementations, so their behavior will be more aligned with the behavior of these functions in VictoriaMetrics:

@rbtcollins
Copy link
Author

It is great to see the xrate proposal. Looks like xrate may be merged or rate may be updated - either way. will VictoriaMetrics adopt the new definition? Mathematically the prometheus proposal doesn't seem to have the issue I reported here.

@clementnuss
Copy link
Contributor

clementnuss commented Apr 12, 2023

I think I just encountered this exact issue @rbtcollins. Thanks a lot for detailedly reporting it, helped a lot.

in my case, the scraping job the Kubernetes apiservers was renamed, resulting in duplicated metrics for ~1min. rate is badly impacted by that, unlike increase/duration which is showing the behaviour I was expecting.

rate_vs_increase

apart from the spike at 10:15, the two series perfectly coincide.
While this might not seem too problematic for a rate[5m], it becomes much more cumbersome when trying to use rate[1w], in which case the spike lasts for a full week, i.e. rate[1w] will be wrong for a full week, just due to 1m of duplicated time series.

I will be going with increase more often now, but are there plans to improve rate or sum(rate()) to prevent cases where short lived counter spikes could skew long-term results ?

@jenskloster
Copy link

jenskloster commented Dec 13, 2023

I think we are experincing the same problem described here, but we are using Increase, not rate
We chose Victoria Metrics because you can count correctly, without having a baseline value of 0.

We are using version v1.95.1

We have a metric endpoint for an application, that is scraped by prometheus and by victoria metrics.
They are scraping at the same interval, so the metric should be comparable.
The application also logs when it increases the counter, so we have a sence of what the correct count should be.

This is how they compare, on the same PromQL:
sum(increase(endk_custom_label_total_count{BusinessProcess="acknowledgmentsent", Status="200"}[$__range]))
both targeting the range api.
image

Prometheus counts too low, due to lack of baseline with value of 0.
Victoria Metrics count too high, way too high, and I can't figure out why.

With some experimentation we found, that addid a colon : to the range in Increase(..[1m:]) fixes the problem.
image

Again, I have no idea why.

The query then looks like this:
sum(increase(endk_custom_label_total_count{BusinessProcess="acknowledgmentsent", Status="200"}[$__range:]))

@hagen1778
Copy link
Collaborator

Hey @jenskloster! Could you try following https://docs.victoriametrics.com/Troubleshooting.html#unexpected-query-results and locating the exact series which producing the most outlying result? If yes, can you show on the range graph how it looks like? Can you share its raw data retrieved via /api/v1/export?

@jenskloster
Copy link

jenskloster commented Jan 5, 2024

Hi @hagen1778
Thanks for the response - we have exported the data.
data.json.gz

This is the query we perform (copy pasted from grafana):

queries:Array[1]
0:Object
datasource:Object
editorMode:"code"
exemplar:false
expr:"sum(increase(endk_custom_label_total_count{BusinessProcess=\"acknowledgmentsent\", exported_job=\"prod-mcz-resourceschedule-app\", Status=\"200\"}[$__range]))"
instant:true
legendFormat:"__auto"
range:false
refId:"A"
requestId:"34A"
utcOffsetSec:0
interval:""
datasourceId:26
intervalMs:120000
maxDataPoints:689
queryCachingTTL:undefined
from:"1698098399000"
to:"1698184798000"

this produces the result of 3331 where the expected result is 947

@hagen1778
Copy link
Collaborator

hagen1778 commented Jan 19, 2024

I'm sorry for late reply @jenskloster! And thanks for provided data!

I've checked the data and noticed some time series have null values:

...23,23,23,null,23,23,null,23,23,null,null,23,23,null,null,23,23,null,23,23,null,...

Moreover, this specific counter starts to fluctuate afterwards:

...,23,23,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,23,24,25,24,25,24,25,24,25,24,25,24,25,24,25,24,25,24,25...

Counter should always grow or remain on the same value. This specific time series behaves like gauge. Do you know why it is so?

@jenskloster
Copy link

Thanks @hagen1778,
That explains why victoria metrics are getting higher count.

based on what you found, our data is the problem - not your product :)
Thanks for your help 🙏

@valyala
Copy link
Collaborator

valyala commented Feb 8, 2024

Closing this issue, since the rate() behavior described in this comment works better in most practical cases than the rate() behavior in Prometheus.

@valyala valyala closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

5 participants