Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent stats from query_range with sparse metrics #996

Closed
johnseekins opened this issue Jan 6, 2021 · 2 comments
Closed

Inconsistent stats from query_range with sparse metrics #996

johnseekins opened this issue Jan 6, 2021 · 2 comments
Labels
question The question issue

Comments

@johnseekins
Copy link
Contributor

johnseekins commented Jan 6, 2021

Describe the bug

This may be similar to #977

We have a metric that is only produced when a host is rebooted (so ideally not that frequently). The stat lands in VictoriaMetrics just fine, but when we attempt to query it, we get very confusing results.

Here is the raw incoming data for the metric:

curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/export" -d "start=1608193891" -d "end=1609872310" -d 'match[]=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq
{
  "metric": {
    "__name__": "cloudeng_fleet_rebooter_rebooted",
    "dc": "dc1",
    "host": "reporterhost",
    "cause": "automatic",
    "group": "unclassified",
    "hostname": "host1",
    "rack": "C",
    "row": "1"
  },
  "values": [
    1
  ],
  "timestamps": [
    1608590133000
  ]
}
{
  "metric": {
    "__name__": "cloudeng_fleet_rebooter_rebooted",
    "dc": "dc1",
    "host": "reporterhost",
    "cause": "manual",
    "group": "unclassified",
    "hostname": "host1",
    "rack": "C",
    "row": "1"
  },
  "values": [
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    1,
    1
  ],
  "timestamps": [
    1608500183000,
    1608678730000,
    1608745638000,
    1608821531000,
    1608848515000,
    1608992928000,
    1609160664000,
    1609260922000,
    1609310867000
  ]
}

And here are a few attempts of getting the data through the query endpoint:

jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq '.data.result[].values | length'
2496
1
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1m' | jq '.data.result[].values | length'
12476
5
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=10m' | jq '.data.result[].values | length'
1248
1
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1s' | jq '.data.result[].values | length'
748594
271
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=15s' | jq '.data.result[].values | length'
49906
18
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/query_range" -d "start=1608193891" -d "end=1609872310" -d 'query=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' -d 'step=1h' | jq '.data.result[].values | length'
206
jseekins@quietthinkpad:~$ curl -Ss "http://victoria-access-1:9481/select/0:0/prometheus/api/v1/export" -d "start=1608193891" -d "end=1609872310" -d 'match[]=cloudeng_fleet_rebooter_rebooted{hostname="host1"}' | jq '.values|length'
1
9

We can see in one case (1 hour step) one of the result sets doesn't even appear! This seems like step handling in query_range isn't working correctly. I'm struggling to imagine why 9 values become over 12,000 values depending on the query step.

Version

# vmstorage-prod --version
vmstorage-20201228-002929-tags-v1.51.0-cluster-0-g66a19e007

Used command-line flags

$ curl -Ss "http://use1-victoria-access-test-1:9481/metrics" | grep flag
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:12"} 1
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:13"} 1
vm_log_messages_total{app_version="vmselect-20201228-002927-tags-v1.51.0-cluster-0-g66a19e007", level="info", location="VictoriaMetrics/lib/logger/flag.go:20"} 48
flag{name="cacheDataPath", value="/data/victoriametrics/cache/cluster_1"} 1
flag{name="dedup.minScrapeInterval", value="10ms"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.connTimeout", value="2m0s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.idleConnTimeout", value="1m0s"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpListenAddr", value=":8481"} 1
flag{name="loggerDisableTimestamps", value="false"} 1
flag{name="loggerErrorsPerSecondLimit", value="0"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="loggerWarnsPerSecondLimit", value="0"} 1
flag{name="memory.allowedBytes", value="0"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="replicationFactor", value="2"} 1
flag{name="search.cacheTimestampOffset", value="5m0s"} 1
flag{name="search.denyPartialResponse", value="false"} 1
flag{name="search.disableCache", value="false"} 1
flag{name="search.latencyOffset", value="1m0s"} 1
flag{name="search.logSlowQueryDuration", value="5s"} 1
flag{name="search.maxConcurrentRequests", value="8"} 1
flag{name="search.maxExportDuration", value="720h0m0s"} 1
flag{name="search.maxLookback", value="0s"} 1
flag{name="search.maxPointsPerTimeseries", value="1000000000"} 1
flag{name="search.maxQueryDuration", value="3m0s"} 1
flag{name="search.maxQueryLen", value="16384"} 1
flag{name="search.maxQueueDuration", value="10s"} 1
flag{name="search.maxStalenessInterval", value="0s"} 1
flag{name="search.minStalenessInterval", value="0s"} 1
flag{name="search.queryStats.lastQueriesCount", value="20000"} 1
flag{name="search.queryStats.minQueryDuration", value="0s"} 1
flag{name="search.resetCacheAuthKey", value="secret"} 1
flag{name="search.storageTimeout", value="0s"} 1
flag{name="search.treatDotsAsIsInRegexps", value="false"} 1
flag{name="selectNode", value="10.40.146.160:8481,10.40.146.159:8481,10.40.146.178:8481,10.40.146.179:8481"} 1
flag{name="storageNode", value="10.40.146.163:8401,10.40.146.164:8401,10.40.146.162:8401,10.40.146.165:8401,10.40.146.169:8401,10.40.146.167:8401,10.40.146.166:8401,10.40.146.168:8401"} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1
@johnseekins
Copy link
Contributor Author

Speaking with @hagen1778 , it seems this is expected behavior in PromQL/MetricsQL. I suppose now I have to figure out how to properly represent this data for my users.

@valyala
Copy link
Collaborator

valyala commented Jan 11, 2021

The context from @hagen1778 for future readers:

query_range does not return raw values. It returns N data points for that series, where N = time_range/step. So if you have a 1 real datapoint every minute, the query_range?step=10s&start=now()-60 will return 60/10=6 data points and /export will return only one datapoint.
The origin of such behaviour comes from pull model - Prometheus never provided a guarantee that metric will be scraped every 10s or so. Instead, Prometheus schedules the scrape job to execute when it is possible and there are resources to do so, but not frequently than scrape_interval. Because of that, scrapes with scrape_interval=10 may happen on 11th sec, then on 15th sec, and then on 11th sec again. But if you will plot such data it will look "strange". That's why originial implementation of query_range returned always perfectly aligned datapoints, where ephemeral data points were just repeating previous "real" datapoint in that data range

See also https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants