Inconsistent stats from query_range with sparse metrics #996

johnseekins · 2021-01-06T15:45:37Z

Describe the bug

This may be similar to #977

We have a metric that is only produced when a host is rebooted (so ideally not that frequently). The stat lands in VictoriaMetrics just fine, but when we attempt to query it, we get very confusing results.

Here is the raw incoming data for the metric:

curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/export" -d "start=1608193891" -d "end=1609872310" -d 'match[]=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq
{
  "metric": {
    "__name__": "cloudeng_fleet_rebooter_rebooted",
    "dc": "dc1",
    "host": "reporterhost",
    "cause": "automatic",
    "group": "unclassified",
    "hostname": "host1",
    "rack": "C",
    "row": "1"
  },
  "values": [
    1
  ],
  "timestamps": [
    1608590133000
  ]
}
{
  "metric": {
    "__name__": "cloudeng_fleet_rebooter_rebooted",
    "dc": "dc1",
    "host": "reporterhost",
    "cause": "manual",
    "group": "unclassified",
    "hostname": "host1",
    "rack": "C",
    "row": "1"
  },
  "values": [
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    1
  ],
  "timestamps": [
    1608500183000,
    1608678730000,
    1608745638000,
    1608821531000,
    1608848515000,
    1608992928000,
    1609160664000,
    1609260922000,
    1609310867000
  ]
}

And here are a few attempts of getting the data through the query endpoint:

jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq '.data.result[].values | length'
2496
1
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1m' | jq '.data.result[].values | length'
12476
5
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=10m' | jq '.data.result[].values | length'
1248
1
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1s' | jq '.data.result[].values | length'
748594
271
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=15s' | jq '.data.result[].values | length'
49906
18
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1h' | jq '.data.result[].values | length'
206
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/export" -d "start=1608193891" -d "end=1609872310" -d 'match[]=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq '.values|length'
1
9

We can see in one case (1 hour step) one of the result sets doesn't even appear! This seems like step handling in query_range isn't working correctly. I'm struggling to imagine why 9 values become over 12,000 values depending on the query step.

Version

# vmstorage-prod --version
vmstorage-20201228-002929-tags-v1.51.0-cluster-0-g66a19e007

Used command-line flags

$ curl -Ss "http://use1-victoria-access-test-1:9481/metrics" | grep flag
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:12"} 1
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:13"} 1
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:20"} 48
flag{name="cacheDataPath", value="/data/victoriametrics/cache/cluster_1"} 1
flag{name="dedup.minScrapeInterval", value="10ms"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.connTimeout", value="2m0s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.idleConnTimeout", value="1m0s"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpListenAddr", value=":8481"} 1
flag{name="loggerDisableTimestamps", value="false"} 1
flag{name="loggerErrorsPerSecondLimit", value="0"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="loggerWarnsPerSecondLimit", value="0"} 1
flag{name="memory.allowedBytes", value="0"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="replicationFactor", value="2"} 1
flag{name="search.cacheTimestampOffset", value="5m0s"} 1
flag{name="search.denyPartialResponse", value="false"} 1
flag{name="search.disableCache", value="false"} 1
flag{name="search.latencyOffset", value="1m0s"} 1
flag{name="search.logSlowQueryDuration", value="5s"} 1
flag{name="search.maxConcurrentRequests", value="8"} 1
flag{name="search.maxExportDuration", value="720h0m0s"} 1
flag{name="search.maxLookback", value="0s"} 1
flag{name="search.maxPointsPerTimeseries", value="1000000000"} 1
flag{name="search.maxQueryDuration", value="3m0s"} 1
flag{name="search.maxQueryLen", value="16384"} 1
flag{name="search.maxQueueDuration", value="10s"} 1
flag{name="search.maxStalenessInterval", value="0s"} 1
flag{name="search.minStalenessInterval", value="0s"} 1
flag{name="search.queryStats.lastQueriesCount", value="20000"} 1
flag{name="search.queryStats.minQueryDuration", value="0s"} 1
flag{name="search.resetCacheAuthKey", value="secret"} 1
flag{name="search.storageTimeout", value="0s"} 1
flag{name="search.treatDotsAsIsInRegexps", value="false"} 1
flag{name="selectNode", value="10.40.146.160:8481,10.40.146.159:8481,10.40.146.178:8481,10.40.146.179:8481"} 1
flag{name="storageNode", value="10.40.146.163:8401,10.40.146.164:8401,10.40.146.162:8401,10.40.146.165:8401,10.40.146.169:8401,10.40.146.167:8401,10.40.146.166:8401,10.40.146.168:8401"} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1

The text was updated successfully, but these errors were encountered:

johnseekins · 2021-01-06T22:14:00Z

Speaking with @hagen1778 , it seems this is expected behavior in PromQL/MetricsQL. I suppose now I have to figure out how to properly represent this data for my users.

valyala · 2021-01-11T16:39:32Z

The context from @hagen1778 for future readers:

query_range does not return raw values. It returns N data points for that series, where N = time_range/step. So if you have a 1 real datapoint every minute, the query_range?step=10s&start=now()-60 will return 60/10=6 data points and /export will return only one datapoint.
The origin of such behaviour comes from pull model - Prometheus never provided a guarantee that metric will be scraped every 10s or so. Instead, Prometheus schedules the scrape job to execute when it is possible and there are resources to do so, but not frequently than scrape_interval. Because of that, scrapes with scrape_interval=10 may happen on 11th sec, then on 15th sec, and then on 11th sec again. But if you will plot such data it will look "strange". That's why originial implementation of query_range returned always perfectly aligned datapoints, where ephemeral data points were just repeating previous "real" datapoint in that data range

See also https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

johnseekins closed this as completed Jan 6, 2021

valyala added the question The question issue label Jan 7, 2021

Fusl mentioned this issue Jan 8, 2021

Difference between data stored and query_range results for data with inconsistent intervals #1003

Closed

grinapo mentioned this issue Feb 5, 2021

query_range seems to have a problem about rounding, wrong data fills the whole range #1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent stats from query_range with sparse metrics #996

Inconsistent stats from query_range with sparse metrics #996

johnseekins commented Jan 6, 2021 •

edited

johnseekins commented Jan 6, 2021

valyala commented Jan 11, 2021

Inconsistent stats from query_range with sparse metrics #996

Inconsistent stats from query_range with sparse metrics #996

Comments

johnseekins commented Jan 6, 2021 • edited

johnseekins commented Jan 6, 2021

valyala commented Jan 11, 2021

johnseekins commented Jan 6, 2021 •

edited