Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum and sum_over_time returning incorrect value when increasing period #563

Closed
omerlin opened this issue Jun 15, 2020 · 14 comments
Closed
Labels
bug Something isn't working question The question issue

Comments

@omerlin
Copy link

omerlin commented Jun 15, 2020

Describe the bug
When querying data until current timestamp with sum() functions and increasing the period of aggregation (7 days, 30 days, 90 days ...), the count becomes incorrect

To Reproduce
Simply provision data in VM until now (7 days for instance). Then print out the total value of the data and see the same data in bigger window (30 days, 90 days, 1 year , ...)
You will see that the total value - that should be invariant - will decrease because the last interval is no more taken into account
Please note that this do not happen if the data are provisioned in the past ( 2 weeks ago for instance) - meaning this is completely related with last time interval.

For example:
This query on 7 days :
api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1591567200&end=1592172000&step=86400
return

1:Array[1591740000,194]
2:Array[1591826400,104]
3:Array[1591912800,217]
4:Array[1591999200,67]
5:Array[1592085600,165]
6:Array[1592172000,85]

The same query on 90 days
api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1584396000&end=1592172000&step=86400
return this:

0:Array[1591660800,112]
1:Array[1591747200,194]
2:Array[1591833600,104]
3:Array[1591920000,217]
4:Array[1592006400,67]
5:Array[1592092800,165]

as you can see the last interval timestamp has changed - and the last interval until now is missing
This has been reproduced on many screens ... always with the same side-effect.

Expected behavior
The count must not change

Screenshots
I could put screenshot but i think the investigation i did on the query are more clear

Version
The line returned when passing --version command line flag to binary. For example:

/opt/victoriametrics/victoria-metrics-prod --version
victoria-metrics-20200505-082139-tags-v1.35.2-0-g0c7cddfc

Used command-line flags
/opt/victoriametrics/victoria-metrics-prod -storageDataPath /u01/reporting/victoriametrics/data -retentionPeriod 36 -influxSkipSingleField -search.maxStalenessInterval=5m
Additional context

@valyala
Copy link
Collaborator

valyala commented Jun 18, 2020

VictoriaMetrics may adjust the returned timestamps if the number of returned data points exceeds 50 - see the corresponding comment in the code for details.

This behaviour can be disabled by passing -search.disableCache command-line flag to VictoriaMetrics. Another option is to pass nocache=1 query arg to /api/v1/query_range.

@valyala valyala added the question The question issue label Jun 18, 2020
@omerlin
Copy link
Author

omerlin commented Jun 21, 2020 via email

@omerlin
Copy link
Author

omerlin commented Jun 25, 2020

Hello,
I have narrowed the issue - this is related to UTC management.
I'm located in France so the time is UTC+2
I have loaded my data on the last 7 days using the back filling api. So all my data are there.
I'm doing sum_over_time() with a granularity of 1 day.
I expect when i query at 30 days, 90 days, 1 year ... to see always the same sum().
Let see with a small python program :

import requests
import time

def query_VM( nb_days):
	max_timestamp=1593036000
	min_timestamp=max_timestamp-nb_days*86400
	print( 'start={}, end={}, duration(days)={}'.format(time.ctime(min_timestamp),time.ctime(max_timestamp),(max_timestamp-min_timestamp)/(86400)))
	query="/api/v1/query_range?query=sum%20by(profile_id)%20(sum_over_time(operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D))&start={}&end={}&step=86400".format(min_timestamp,max_timestamp)
	r=requests.get("http://localhost:28428/%s" % (query))
	s=0
	for k in r.json()['data']['result'][0]['values']:
		print( time.ctime(k[0]), int(k[1]) )
		s+= int(k[1])
	print( 'SUM=%d' % (s))

as you can see i have a tunnel to localhost:28428 that access directly to the VM API.
Let query on different period, 7 days, 30 days, 60 days, 90 days

query_VM(7)
start=Thu Jun 18 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=7.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074
query_VM(30)
start=Tue May 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=30.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074

Okay this is what we expect - same result than on 7 days - same interval returned by the API

query_VM(60)
start=Sun Apr 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=60.0
Fri Jun 19 02:00:00 2020 31454
Sat Jun 20 02:00:00 2020 73176
Sun Jun 21 02:00:00 2020 76248
Mon Jun 22 02:00:00 2020 80581
Tue Jun 23 02:00:00 2020 74453
Wed Jun 24 02:00:00 2020 70111
SUM=406023

Bingo ! Here is the "bug" - the date interval suddenly changed with a +2:00:00 - changing the interval counts and even the number of interval.
So the SUM is incorrect because somewhere the UTC time definition changed when going back in time.
I was first thinking it could be related to the time shift in France UTC+1 (Winter time) to UTC+2 (Summer time) ... but as it was the 28th of march - this is not the reason.
What do you think of this please ?
Thanks a lot for your attention
Olivier

@valyala
Copy link
Collaborator

valyala commented Jun 25, 2020

@omerlin , thanks for the detailed investigation!

VictoriaMetrics should store and process all the timestamps in UTC. I'll double check this and return back with findings.

@omerlin
Copy link
Author

omerlin commented Jul 3, 2020

@valyala, thanks for all your good work.
I know it's quite a lot of work to investigate such an issue, particularly reproducing issues. I have all the materials for that with back filling injection tool and error reproduction by script as shown above.
I will try to do it myself recompiling the code and try debugging even if i don't know Go programming. If you advise on tooling helper or method to debug go code efficiently let me know.

@valyala
Copy link
Collaborator

valyala commented Jul 3, 2020

@omerlin , see build instructions here. I'd suggest you exporting all the raw data that is used in the query via /api/v1/export (just pass match[]=operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D query arg together with start and end args from /api/v1/query_range), then manually calculating sum_over_time for the returned data in Python script and comparing the results with the results returned by VictoriaMetrics . Note that /api/v1/export returns timestamps in milliseconds, not in seconds, so don't forget dividing them by 1000. Note also that VictoriaMetrics always returns timestamps in UTC.

@omerlin
Copy link
Author

omerlin commented Jul 22, 2020

@valyala, i'm working on it. Just one question about the /write API. We are using the InluxDB data format with the timestamp to do the backfilling. But do we need to have a TZ timestamp or a UTC timestamp ?

@valyala
Copy link
Collaborator

valyala commented Jul 22, 2020

do we need to have a TZ timestamp or a UTC timestamp ?

VictoriaMetrics treats all the timestamps in the ingested data as UTC.

@omerlin
Copy link
Author

omerlin commented Jul 29, 2020

@valyala i finally found the issue - this is not related to timezone but to the modulo function used to have cache buckets alignment.
PromQL query_range are processed into prometheus/prometheus.go - the start/end are altered by this code :

if mayCache {
start, end = promql.AdjustStartEnd(start, end, step)
}

As i'm using Grafana i cannot add a form &nocache=1 in my request and i'm entering in the promql.AdjustStartEnd function logic.
In my case, my step=86400 (one day in seconds)
Then if the number of days is considered as small ( 50 in your code) - the star, end are not changed.

func AdjustStartEnd(start, end, step int64) (int64, int64) {
points := (end-start)/step + 1
if points < minTimeseriesPointsForTimeRounding {
// Too small number of points for rounding.
return start, end
}

as line 26 we have :

// The minimum number of points per timeseries for enabling time rounding.
// This improves cache hit ratio for frequently requested queries over
// big time ranges.
const minTimeseriesPointsForTimeRounding = 50

then over 50 days, the start/end are altered with rounding logic inside the function AdjustStartEnd
The problem is then that the start/end alteration change completly the sum() computation ....

Do you think it could be possible to make the nocache activation per configuration (as the server configuration search.nocache is not taken into account for this kind of query ) ...

Btw ... when writing data using UTC and reading back with grafana i didn't have the issue because my start/step were modulo 86400 !!
Happy to finally understand.
Thanks for your help
Olivier
PS: i have created a small patch with a new parameter flag search.disableQueryCache ... but i'm sure you will propose something better like extending for instance the search.disableCache to the query_range() code also

valyala added a commit that referenced this issue Jul 30, 2020
…api/v1/query_range` when `-search.disableCache` command-line flag is set

Updates #563
valyala added a commit that referenced this issue Jul 30, 2020
…api/v1/query_range` when `-search.disableCache` command-line flag is set

Updates #563
@valyala
Copy link
Collaborator

valyala commented Jul 30, 2020

@omerlin , thanks for the investigation! The commit f0c678c stops adjusting of start and end args when VictoriaMetrics is started with -search.disableCache command-line flag. Could you build VictoriaMetrics from this commit according to these docs and verify whether it works as expected for your case?

@valyala valyala added the bug Something isn't working label Jul 30, 2020
@omerlin
Copy link
Author

omerlin commented Jul 31, 2020

@valyala, thanks again for reactivity ! I have tested and it's working perfectly has expected.
I suppose this fix will be part of the next 1.39.3, right ?

@valyala
Copy link
Collaborator

valyala commented Jul 31, 2020

I suppose this fix will be part of the next 1.39.3, right ?

Yes.

@valyala
Copy link
Collaborator

valyala commented Aug 5, 2020

The bugfix has been included in v1.39.3. The bugfix stops adjusting start and end args when -search.disableCache command-line flag is passed during VictoriaMetrics startup.

Closing the issue as fixed.

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

@valyala valyala closed this as completed Aug 5, 2020
@omerlin
Copy link
Author

omerlin commented Aug 31, 2020

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

This is not possible easily as the request is sent by grafana using day granularity ... so the start and end are by default defined using local TZ ... as VictoriaMetrics is using UTC for all its calculation. So using last 30 days, last 90 days buttons ... the start and stop are not aligned with internal caching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants