Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Use nproc --all for __NR_CPUS__ #12121

Merged
merged 1 commit into from Jun 17, 2020
Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Jun 16, 2020

This uses -D__NR_CPUS__=$(nproc --all) (or GetNumPossibleCPUs when
invoked from Go) to compile the datapath.

This fixes an issue where cilium monitor fails to report any events
on AKS, due to the perf_event_array map duplicates being created
with different max_entries sizes, presumably causing the datapath
to write to the first one, while the agent is reading from the second
one.

This bug occurs for example on AKS due to the present/possible cpuset on
the VMs. The default Standard_D2s_v3 node size has 2 present CPUs, but
128 possible CPUs in /sys/devices/system/cpu.

Fixes: #12070

This uses `-D__NR_CPUS__=$(nproc --all)` (or `GetNumPossibleCPUs` when
invoked from Go) to compile the datapath.

This fixes an issue where cilium monitor fails to report any events
on AKS, due to the `perf_event_array` map duplicates being created
with different max_entries sizes, presumably causing the datapath
to write to the first one, while the agent is reading from the second
one.

This bug occurs for example on AKS due to the present/possible cpuset on
the VMs. The default Standard_D2s_v3 node size has 2 present CPUs, but
128 possible CPUs in /sys/devices/system/cpu.

Fixes: #12070

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro added priority/release-blocker sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 16, 2020
@gandro gandro requested a review from a team June 16, 2020 20:52
@gandro gandro requested review from a team as code owners June 16, 2020 20:52
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 16, 2020
@gandro gandro marked this pull request as draft June 16, 2020 20:52
@gandro
Copy link
Member Author

gandro commented Jun 16, 2020

test-me-please

@gandro
Copy link
Member Author

gandro commented Jun 16, 2020

This is an alternative to #12119 - I have just validated that this fixes the issue on AKS - marking ready for review.

@gandro gandro marked this pull request as ready for review June 16, 2020 21:02
@gandro
Copy link
Member Author

gandro commented Jun 16, 2020

Please check out #12070 (comment) for @pchaigno's great explanation of why this was not causing more troubles beforehand.

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@pchaigno
Copy link
Member

There's two bugs fixed here in a way:

  1. We're not using the same map sizes everywhere. That bug was introduced in daemon: Create all global maps in cilium-agent #10626 (v1.8.0-rc1).
  2. We're using present CPUs in datapath. That bug has been there for a very long time (ex. cc74306, at least v0.10.1).

Do we want to backport this to v1.7 or even v1.6?

@borkmann
Copy link
Member

(Given this is only events and signals map, this shouldn't have upgrade implications.)

@joestringer
Copy link
Member

Do we want to backport this to v1.7 or even v1.6?

I believe that the core issue here where cilium doesn't report any flows is unique to v1.8 because v1.8 began opening (creating) this map prior to datapath provisioning.

However if someone were to hotplug CPUs on v1.7 or earlier, they could plausibly also hit this. The fix itself looks pretty harmless, v1.7 backport is reasonable to me.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.6 Jun 16, 2020
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 37.035% when pulling 6370441 on pr/gandro/bpf-use-present-cpuset into 496624b on master.

@pchaigno
Copy link
Member

The K8s-1.17-Kernel-4.19 failure happened before on master. I filled #12127 for it. Since other required builds are passing, I think we're fine to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 17, 2020
@borkmann borkmann merged commit 8191b16 into master Jun 17, 2020
1.8.0 automation moved this from In progress to Merged Jun 17, 2020
@borkmann borkmann deleted the pr/gandro/bpf-use-present-cpuset branch June 17, 2020 07:16
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 18, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 18, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jul 1, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jul 1, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.6 Jul 2, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.6 Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.7.6
Backport done to v1.7
1.8.0
  
Merged
1.8.0
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

cilium monitor broken on AKS 1.15 (engine 0.47)
6 participants