Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

Closed
j-mie opened this issue Apr 18, 2020 · 11 comments
Closed

code_verb:apiserver_request_total:increase30d is failing to evaluate #503

j-mie opened this issue Apr 18, 2020 · 11 comments

Comments

@j-mie
Copy link

j-mie commented Apr 18, 2020

What happened?
I upgraded from an older version from a few months ago to the latest 0.5 release with jb and Prometheus is failing rule evaluations starting to fire immediately. I checked Prometheus and https://github.com/coreos/kube-prometheus/blob/dcc46c8aa8c242b845024188a66171b5f08b8513/manifests/prometheus-rules.yaml#L393 is in ERR state: query processing would load too many samples into memory in query execution

Did you expect to see some different?
The rule included shouldn't be failing

How to reproduce it (as minimally and precisely as possible):

Environment
GKE

  • Prometheus Operator version:

    v0.38.1

  • Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:16:15Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.5", GitCommit:"a5bf731ea129336a3cf32c3375317b3a626919d7", GitTreeState:"clean", BuildDate:"2020-03-31T02:49:49Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

    GKE

level=warn ts=2020-04-18T11:01:09.892Z caller=manager.go:525 component="rule manager" group=kube-apiserver.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{job=\"apiserver\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"

@valeneiko
Copy link

I am having the same issue after updating from d1c9062 to dcc46c8.

Prometheus is deployed on Azure AKS

@brancz
Copy link
Collaborator

brancz commented Apr 21, 2020

cc @metalmatze @povilasv

@smoke
Copy link

smoke commented Apr 22, 2020

The Helm stable/prometheus-operator v8.13.0 is also affected by this issue since helm/charts#22003 was merged.

In addition I confirm this calculation adheres to increased CPU load :(

@povilasv
Copy link

I've seen this before you need to bump the max samples limit.

I think I did something like - --query.max-samples=100000000 which did it for me.

@j-mie
Copy link
Author

j-mie commented Apr 22, 2020

I did find the max-samples solution but if it's required it should be set by kube-prometheus so that the default rules don't fail..

@smoke
Copy link

smoke commented Apr 23, 2020

In addition increasing max-samples increases the CPU load and needed time to calculate such metric, so it is not a straightforward option.

@metalmatze
Copy link
Member

In addition I confirm this calculation adheres to increased CPU load :(

One option to improve this would be to have the availability calculations in its own group. We then are able to set a higher evaluation interval, like every 3min (taking 5m staleness into account).

@XNorith
Copy link

XNorith commented Apr 28, 2020

Might it be reasonable to change the query to something like:
sum by(code, verb) (increase(apiserver_request_total{job="apiserver"}[30d:1h])
So that we're not evaluating every data point? We might lose some precision every time an apiserver gets restarted, but that's pretty infrequent in my experience.

@metalmatze
Copy link
Member

In those recording rules we don't have subqueries like that. It's literally just summing up the counts of requests. For 28d that might still be too many data points.

@dsexton
Copy link

dsexton commented Jul 9, 2020

@metalmatze synced the updated rules downstream to the Helm chart, and still seeing a couple errors for too many samples.

For example:

level=warn ts=2020-07-09T03:34:57.417Z caller=manager.go:577 component="rule manager" group=kube-apiserver-availability.rules msg="Evaluating rule failed" rule="record: code_verb:apiserver_request_total:increase30d\nexpr: sum by(code, verb) (increase(apiserver_request_total{code=~\"2..\",job=\"apiserver\",verb=\"LIST\"}[30d]))\n" err="query processing would load too many samples into memory in query execution"

@paulfantom
Copy link
Member

Closing as this appears to be already fixed. Please reopen if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants