sum and sum_over_time returning incorrect value when increasing period #563

omerlin · 2020-06-15T16:47:56Z

Describe the bug
When querying data until current timestamp with sum() functions and increasing the period of aggregation (7 days, 30 days, 90 days ...), the count becomes incorrect

To Reproduce
Simply provision data in VM until now (7 days for instance). Then print out the total value of the data and see the same data in bigger window (30 days, 90 days, 1 year , ...)
You will see that the total value - that should be invariant - will decrease because the last interval is no more taken into account
Please note that this do not happen if the data are provisioned in the past ( 2 weeks ago for instance) - meaning this is completely related with last time interval.

For example:
This query on 7 days :
api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1591567200&end=1592172000&step=86400
return

1:Array[1591740000,194]
2:Array[1591826400,104]
3:Array[1591912800,217]
4:Array[1591999200,67]
5:Array[1592085600,165]
6:Array[1592172000,85]

The same query on 90 days
api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1584396000&end=1592172000&step=86400
return this:

0:Array[1591660800,112]
1:Array[1591747200,194]
2:Array[1591833600,104]
3:Array[1591920000,217]
4:Array[1592006400,67]
5:Array[1592092800,165]

as you can see the last interval timestamp has changed - and the last interval until now is missing
This has been reproduced on many screens ... always with the same side-effect.

Expected behavior
The count must not change

Screenshots
I could put screenshot but i think the investigation i did on the query are more clear

Version
The line returned when passing --version command line flag to binary. For example:

/opt/victoriametrics/victoria-metrics-prod --version
victoria-metrics-20200505-082139-tags-v1.35.2-0-g0c7cddfc

Used command-line flags
/opt/victoriametrics/victoria-metrics-prod -storageDataPath /u01/reporting/victoriametrics/data -retentionPeriod 36 -influxSkipSingleField -search.maxStalenessInterval=5m
Additional context

The text was updated successfully, but these errors were encountered:

valyala · 2020-06-18T21:49:26Z

VictoriaMetrics may adjust the returned timestamps if the number of returned data points exceeds 50 - see the corresponding comment in the code for details.

This behaviour can be disabled by passing -search.disableCache command-line flag to VictoriaMetrics. Another option is to pass nocache=1 query arg to /api/v1/query_range.

omerlin · 2020-06-21T11:56:03Z

The number of points are not increasing... We are just seeing the same points from a longer perspective... And so we expect the same results. But curiously when increasing the query period, the last interval is truncated and so the count is false. You can see it in the query example I posted at the beginning of this post. Le jeu. 18 juin 2020 à 23:49, Aliaksandr Valialkin <notifications@github.com> a écrit :

…

VictoriaMetrics may adjust the returned timestamps if the number of returned data points exceeds 50 - see the corresponding comment in the code <https://github.com/VictoriaMetrics/VictoriaMetrics/blob/a6f16dcc11891c7ab1dbe3e3129cee7210da17bc/app/vmselect/promql/eval.go#L23> for details. This behaviour can be disabled by passing -search.disableCache command-line flag to VictoriaMetrics. Another option is to pass nocache=1 query arg to /api/v1/query_range. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#563 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEUPYQCLOB6V5VEUS43Z3CTRXKDXHANCNFSM4N6LIM5A> .

omerlin · 2020-06-25T16:04:28Z

Hello,
I have narrowed the issue - this is related to UTC management.
I'm located in France so the time is UTC+2
I have loaded my data on the last 7 days using the back filling api. So all my data are there.
I'm doing sum_over_time() with a granularity of 1 day.
I expect when i query at 30 days, 90 days, 1 year ... to see always the same sum().
Let see with a small python program :

import requests
import time

def query_VM( nb_days):
	max_timestamp=1593036000
	min_timestamp=max_timestamp-nb_days*86400
	print( 'start={}, end={}, duration(days)={}'.format(time.ctime(min_timestamp),time.ctime(max_timestamp),(max_timestamp-min_timestamp)/(86400)))
	query="/api/v1/query_range?query=sum%20by(profile_id)%20(sum_over_time(operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D))&start={}&end={}&step=86400".format(min_timestamp,max_timestamp)
	r=requests.get("http://localhost:28428/%s" % (query))
	s=0
	for k in r.json()['data']['result'][0]['values']:
		print( time.ctime(k[0]), int(k[1]) )
		s+= int(k[1])
	print( 'SUM=%d' % (s))

as you can see i have a tunnel to localhost:28428 that access directly to the VM API.
Let query on different period, 7 days, 30 days, 60 days, 90 days

query_VM(7)
start=Thu Jun 18 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=7.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074

query_VM(30)
start=Tue May 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=30.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074

Okay this is what we expect - same result than on 7 days - same interval returned by the API

query_VM(60)
start=Sun Apr 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=60.0
Fri Jun 19 02:00:00 2020 31454
Sat Jun 20 02:00:00 2020 73176
Sun Jun 21 02:00:00 2020 76248
Mon Jun 22 02:00:00 2020 80581
Tue Jun 23 02:00:00 2020 74453
Wed Jun 24 02:00:00 2020 70111
SUM=406023

Bingo ! Here is the "bug" - the date interval suddenly changed with a +2:00:00 - changing the interval counts and even the number of interval.
So the SUM is incorrect because somewhere the UTC time definition changed when going back in time.
I was first thinking it could be related to the time shift in France UTC+1 (Winter time) to UTC+2 (Summer time) ... but as it was the 28th of march - this is not the reason.
What do you think of this please ?
Thanks a lot for your attention
Olivier

valyala · 2020-06-25T20:50:32Z

@omerlin , thanks for the detailed investigation!

VictoriaMetrics should store and process all the timestamps in UTC. I'll double check this and return back with findings.

omerlin · 2020-07-03T07:33:44Z

@valyala, thanks for all your good work.
I know it's quite a lot of work to investigate such an issue, particularly reproducing issues. I have all the materials for that with back filling injection tool and error reproduction by script as shown above.
I will try to do it myself recompiling the code and try debugging even if i don't know Go programming. If you advise on tooling helper or method to debug go code efficiently let me know.

valyala · 2020-07-03T14:12:58Z

@omerlin , see build instructions here. I'd suggest you exporting all the raw data that is used in the query via /api/v1/export (just pass match[]=operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D query arg together with start and end args from /api/v1/query_range), then manually calculating sum_over_time for the returned data in Python script and comparing the results with the results returned by VictoriaMetrics . Note that /api/v1/export returns timestamps in milliseconds, not in seconds, so don't forget dividing them by 1000. Note also that VictoriaMetrics always returns timestamps in UTC.

omerlin · 2020-07-22T12:22:16Z

@valyala, i'm working on it. Just one question about the /write API. We are using the InluxDB data format with the timestamp to do the backfilling. But do we need to have a TZ timestamp or a UTC timestamp ?

valyala · 2020-07-22T12:29:24Z

do we need to have a TZ timestamp or a UTC timestamp ?

VictoriaMetrics treats all the timestamps in the ingested data as UTC.

omerlin · 2020-07-29T16:38:04Z

@valyala i finally found the issue - this is not related to timezone but to the modulo function used to have cache buckets alignment.
PromQL query_range are processed into prometheus/prometheus.go - the start/end are altered by this code :

if mayCache {
start, end = promql.AdjustStartEnd(start, end, step)
}

As i'm using Grafana i cannot add a form &nocache=1 in my request and i'm entering in the promql.AdjustStartEnd function logic.
In my case, my step=86400 (one day in seconds)
Then if the number of days is considered as small ( 50 in your code) - the star, end are not changed.

func AdjustStartEnd(start, end, step int64) (int64, int64) {
points := (end-start)/step + 1
if points < minTimeseriesPointsForTimeRounding {
// Too small number of points for rounding.
return start, end
}

as line 26 we have :

// The minimum number of points per timeseries for enabling time rounding.
// This improves cache hit ratio for frequently requested queries over
// big time ranges.
const minTimeseriesPointsForTimeRounding = 50

then over 50 days, the start/end are altered with rounding logic inside the function AdjustStartEnd
The problem is then that the start/end alteration change completly the sum() computation ....

Do you think it could be possible to make the nocache activation per configuration (as the server configuration search.nocache is not taken into account for this kind of query ) ...

Btw ... when writing data using UTC and reading back with grafana i didn't have the issue because my start/step were modulo 86400 !!
Happy to finally understand.
Thanks for your help
Olivier
PS: i have created a small patch with a new parameter flag search.disableQueryCache ... but i'm sure you will propose something better like extending for instance the search.disableCache to the query_range() code also

…api/v1/query_range` when `-search.disableCache` command-line flag is set Updates #563

valyala · 2020-07-30T20:18:08Z

@omerlin , thanks for the investigation! The commit f0c678c stops adjusting of start and end args when VictoriaMetrics is started with -search.disableCache command-line flag. Could you build VictoriaMetrics from this commit according to these docs and verify whether it works as expected for your case?

omerlin · 2020-07-31T09:02:59Z

@valyala, thanks again for reactivity ! I have tested and it's working perfectly has expected.
I suppose this fix will be part of the next 1.39.3, right ?

valyala · 2020-07-31T09:31:22Z

I suppose this fix will be part of the next 1.39.3, right ?

Yes.

valyala · 2020-08-05T11:15:39Z

The bugfix has been included in v1.39.3. The bugfix stops adjusting start and end args when -search.disableCache command-line flag is passed during VictoriaMetrics startup.

Closing the issue as fixed.

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

omerlin · 2020-08-31T09:40:42Z

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

This is not possible easily as the request is sent by grafana using day granularity ... so the start and end are by default defined using local TZ ... as VictoriaMetrics is using UTC for all its calculation. So using last 30 days, last 90 days buttons ... the start and stop are not aligned with internal caching

valyala added the question The question issue label Jun 18, 2020

valyala added a commit that referenced this issue Jul 30, 2020

app/vmselect: do not adjust start and end query args passed to `/…

f0c678c

…api/v1/query_range` when `-search.disableCache` command-line flag is set Updates #563

valyala added a commit that referenced this issue Jul 30, 2020

app/vmselect: do not adjust start and end query args passed to `/…

0c00fe7

…api/v1/query_range` when `-search.disableCache` command-line flag is set Updates #563

valyala added the bug Something isn't working label Jul 30, 2020

valyala closed this as completed Aug 5, 2020

valyala mentioned this issue Nov 2, 2021

Query timezone affects query results correctness #1768

Closed

flashmouse mentioned this issue Nov 30, 2023

range query: timestamps are inconsistent on year range #5392

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum and sum_over_time returning incorrect value when increasing period #563

sum and sum_over_time returning incorrect value when increasing period #563

omerlin commented Jun 15, 2020

valyala commented Jun 18, 2020

omerlin commented Jun 21, 2020 via email

omerlin commented Jun 25, 2020

valyala commented Jun 25, 2020

omerlin commented Jul 3, 2020

valyala commented Jul 3, 2020

omerlin commented Jul 22, 2020

valyala commented Jul 22, 2020

omerlin commented Jul 29, 2020 •

edited

valyala commented Jul 30, 2020

omerlin commented Jul 31, 2020

valyala commented Jul 31, 2020

valyala commented Aug 5, 2020

omerlin commented Aug 31, 2020

sum and sum_over_time returning incorrect value when increasing period #563

sum and sum_over_time returning incorrect value when increasing period #563

Comments

omerlin commented Jun 15, 2020

valyala commented Jun 18, 2020

omerlin commented Jun 21, 2020 via email

omerlin commented Jun 25, 2020

valyala commented Jun 25, 2020

omerlin commented Jul 3, 2020

valyala commented Jul 3, 2020

omerlin commented Jul 22, 2020

valyala commented Jul 22, 2020

omerlin commented Jul 29, 2020 • edited

valyala commented Jul 30, 2020

omerlin commented Jul 31, 2020

valyala commented Jul 31, 2020

valyala commented Aug 5, 2020

omerlin commented Aug 31, 2020

omerlin commented Jul 29, 2020 •

edited