code_verb:apiserver_request_total:increase30d is failing to evaluate #503

j-mie · 2020-04-18T10:49:57Z

What happened?
I upgraded from an older version from a few months ago to the latest 0.5 release with jb and Prometheus is failing rule evaluations starting to fire immediately. I checked Prometheus and https://github.com/coreos/kube-prometheus/blob/dcc46c8aa8c242b845024188a66171b5f08b8513/manifests/prometheus-rules.yaml#L393 is in ERR state: query processing would load too many samples into memory in query execution

Did you expect to see some different?
The rule included shouldn't be failing

How to reproduce it (as minimally and precisely as possible):

Environment
GKE

Prometheus Operator version:

v0.38.1
Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:16:15Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.5", GitCommit:"a5bf731ea129336a3cf32c3375317b3a626919d7", GitTreeState:"clean", BuildDate:"2020-03-31T02:49:49Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:

GKE

level=warn ts=2020-04-18T11:01:09.892Z caller=manager.go:525 component="rule manager" group=kube-apiserver.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{job=\"apiserver\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"

The text was updated successfully, but these errors were encountered:

valeneiko · 2020-04-20T16:36:51Z

I am having the same issue after updating from d1c9062 to dcc46c8.

Prometheus is deployed on Azure AKS

brancz · 2020-04-21T09:10:27Z

cc @metalmatze @povilasv

smoke · 2020-04-22T12:42:04Z

The Helm stable/prometheus-operator v8.13.0 is also affected by this issue since helm/charts#22003 was merged.

In addition I confirm this calculation adheres to increased CPU load :(

povilasv · 2020-04-22T14:13:00Z

I've seen this before you need to bump the max samples limit.

I think I did something like - --query.max-samples=100000000 which did it for me.

j-mie · 2020-04-22T14:55:05Z

I did find the max-samples solution but if it's required it should be set by kube-prometheus so that the default rules don't fail..

smoke · 2020-04-23T15:08:26Z

In addition increasing max-samples increases the CPU load and needed time to calculate such metric, so it is not a straightforward option.

metalmatze · 2020-04-24T14:54:07Z

In addition I confirm this calculation adheres to increased CPU load :(

One option to improve this would be to have the availability calculations in its own group. We then are able to set a higher evaluation interval, like every 3min (taking 5m staleness into account).

XNorith · 2020-04-28T14:51:49Z

Might it be reasonable to change the query to something like:
sum by(code, verb) (increase(apiserver_request_total{job="apiserver"}[30d:1h])
So that we're not evaluating every data point? We might lose some precision every time an apiserver gets restarted, but that's pretty infrequent in my experience.

metalmatze · 2020-04-29T09:49:38Z

In those recording rules we don't have subqueries like that. It's literally just summing up the counts of requests. For 28d that might still be too many data points.

dsexton · 2020-07-09T11:49:49Z

@metalmatze synced the updated rules downstream to the Helm chart, and still seeing a couple errors for too many samples.

For example:

level=warn ts=2020-07-09T03:34:57.417Z caller=manager.go:577 component="rule manager" group=kube-apiserver-availability.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{code=~\"2..\",job=\"apiserver\",verb=\"LIST\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"

paulfantom · 2021-03-02T20:23:27Z

Closing as this appears to be already fixed. Please reopen if this is still an issue.

smoke mentioned this issue Apr 22, 2020

[stable/prometheus-operator] proper sync of rules and dashboard with upstream helm/charts#22003

Merged

4 tasks

smoke mentioned this issue Apr 23, 2020

[prometheus-operator] code_verb:apiserver_request_total:increase30d is failing to evaluate since chart version 8.12.14 helm/charts#22081

Closed

metalmatze mentioned this issue Apr 27, 2020

Separate kube-apiserver-availability.rules into own group (w/ 3min interval) kubernetes-monitoring/kubernetes-mixin#403

Merged

metalmatze mentioned this issue Apr 29, 2020

Update kubernetes-mixin for kube-apiserver-availability fixes #516

Closed

simonpasquier mentioned this issue Dec 7, 2020

Does prometheus have a thread leak other CPU usage bug? #747

Closed

paulfantom closed this as completed Mar 2, 2021

artazar mentioned this issue Aug 13, 2021

PrometheusMissingRuleEvaluations Warning Fired prometheus-operator/prometheus-operator#3388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

j-mie commented Apr 18, 2020 •

edited

Loading

valeneiko commented Apr 20, 2020

brancz commented Apr 21, 2020

smoke commented Apr 22, 2020

povilasv commented Apr 22, 2020

j-mie commented Apr 22, 2020

smoke commented Apr 23, 2020

metalmatze commented Apr 24, 2020

XNorith commented Apr 28, 2020

metalmatze commented Apr 29, 2020

dsexton commented Jul 9, 2020

paulfantom commented Mar 2, 2021

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

Comments

j-mie commented Apr 18, 2020 • edited Loading

valeneiko commented Apr 20, 2020

brancz commented Apr 21, 2020

smoke commented Apr 22, 2020

povilasv commented Apr 22, 2020

j-mie commented Apr 22, 2020

smoke commented Apr 23, 2020

metalmatze commented Apr 24, 2020

XNorith commented Apr 28, 2020

metalmatze commented Apr 29, 2020

dsexton commented Jul 9, 2020

paulfantom commented Mar 2, 2021

j-mie commented Apr 18, 2020 •

edited

Loading