Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium_bpf_map_pressure show incorrect map utilization #32311

Open
2 of 3 tasks
spal-x opened this issue May 2, 2024 · 7 comments
Open
2 of 3 tasks

cilium_bpf_map_pressure show incorrect map utilization #32311

spal-x opened this issue May 2, 2024 · 7 comments
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@spal-x
Copy link

spal-x commented May 2, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We encountered errors when trying to create a service:
2024-04-24T18:48:56.681579908+03:00 level=error msg="Error while inserting service in LB map" error="Unable to update service entry [10.237.56.27:4699](http://10.237.56.27:4699/) => 225157 0 (19535) [0x0 0x0]: Unable to update element for LB bpf map: You can resize it with the flag \"--bpf-lb-map-max\". The resizing might break existing connections to services" k8sNamespace=dev-webinar-bl k8sSvcName=webinar-bl subsys=k8s-watcher
According to documentation
we check metric
cilium_bpf_map_pressure with label map_name="lb4_services_v2" and saw its value is equal 0.420758
Configured max size is 65536.
When we superficially check real map size with bpftool on nodes, we made sure that the map is full and has 65536 entries.
The question arises: why do we get the wrong metric value?

Cilium Version

Client: 1.13.2 8cb94c7 2023-04-17T23:19:21+02:00 go version go1.19.8 linux/amd64
Daemon: 1.13.2 8cb94c7 2023-04-17T23:19:21+02:00 go version go1.19.8 linux/amd64

Kernel Version

Linux 5.14.0-168.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 23 11:43:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

1.27.1

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@spal-x spal-x added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 2, 2024
@youngnick
Copy link
Contributor

Thanks for this issue @spal-x, but unfortunately without some extra information like the zip file produced by the process at https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection, it will be difficult to answer your question. Could you upload a sysdump and I will mark this for the correct team's attention?

@youngnick youngnick added the need-more-info More information is required to further debug or fix the issue. label May 3, 2024
@spal-x
Copy link
Author

spal-x commented May 3, 2024

Thanks for this issue @spal-x, but unfortunately without some extra information like the zip file produced by the process at https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection, it will be difficult to answer your question. Could you upload a sysdump and I will mark this for the correct team's attention?

Unfortunately I can only provide cilium bugtool output. It`s enough?
One more question: I understand correctly that the metric value should show the ratio of the current map size to the maximum сonfigured?

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 3, 2024
@youngnick
Copy link
Contributor

Anything you can give will help, yes.

And yes, I believe your understanding of the metric is correct.

@youngnick youngnick added the need-more-info More information is required to further debug or fix the issue. label May 6, 2024
@spal-x
Copy link
Author

spal-x commented May 6, 2024

Anything you can give will help, yes.

And yes, I believe your understanding of the metric is correct.

cilium-bugtool-20240503-091354.767+0000-UTC-1826744592.zip

I delete some large table dumps (like connection tracking or SNAT).

@github-actions github-actions bot removed the need-more-info More information is required to further debug or fix the issue. label May 6, 2024
@youngnick youngnick added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels May 6, 2024
@spal-x
Copy link
Author

spal-x commented May 17, 2024

Any info?

@julianwiedmann
Copy link
Member

Client: 1.13.2 8cb94c7 2023-04-17T23:19:21+02:00 go version go1.19.8 linux/amd64

I'd suggest reproducing on a current release. Even for v1.13 you're already missing a full year of fixes...

@spal-x
Copy link
Author

spal-x commented May 27, 2024

Client: 1.13.2 8cb94c7 2023-04-17T23:19:21+02:00 go version go1.19.8 linux/amd64

I'd suggest reproducing on a current release. Even for v1.13 you're already missing a full year of fixes...

Sorry for the long response. I observe same behavior on 1.15.2 version in our test enviroment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

3 participants