Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Dump ipcache without probing for support #19501

Merged

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Apr 20, 2022

Ipcache SupportDump() and SupportsDelete() open the map to probe for the
support if the map is not already open and also schedule the
bpf-map-sync-cilium_ipcache controller. If the controller is run before
initMaps(), initMaps will fail as the controller will leave the map open
and initMaps() assumes this not be the case.

Solve this by not trying to detect dump support, but try dump and see if
it succeeds. This is exactly the same logic dump support detection would
have used anyway, and this fixes Cilium Agent crash on kernels that do not
support ipcache dump operations and when certain Cilium features are
enabled on slow machines that caused the scheduled controller to run too
soon.

Fixes: #19360
Fixes: #19495
Signed-off-by: Jarno Rajahalme jarno@isovalent.com

Fixed Cilium agent regression causing a crash due to ipcache controller being scheduled too soon.

@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.10 labels Apr 20, 2022
@jrajahalme jrajahalme requested review from a team and ldelossa April 20, 2022 11:23
@jrajahalme
Copy link
Member Author

/test

@jrajahalme
Copy link
Member Author

/test-1.23-net-next

@jrajahalme
Copy link
Member Author

/test-1.21-5.4

@jrajahalme
Copy link
Member Author

/test-1.16-4.9

@jrajahalme
Copy link
Member Author

Bunch of VM provisioning fails like this:

14:40:50  ==> k8s1-1.16: Running 'pre-boot' VM customizations...
14:40:52  A customization command failed:
14:40:52  
14:40:52  ["natnetwork", "add", "--netname", "natnet1", "--network", "fd08::/64", "--ipv6", "on", "--enable"]
14:40:52  
14:40:52  The following error was experienced:
14:40:52  
14:40:52  #<Vagrant::Errors::VBoxManageError: There was an error while executing `VBoxManage`, a CLI used by Vagrant
14:40:52  for controlling VirtualBox. The command and stderr is shown below.
14:40:52  
14:40:52  Command: ["natnetwork", "add", "--netname", "natnet1", "--network", "fd08::/64", "--ipv6", "on", "--enable"]
14:40:52  
14:40:52  Stderr: VBoxManage: error: NATNetwork server already exists
14:40:52  >
14:40:52  
14:40:52  Please fix this customization and try again.

Likely specific to which runner gets hit by the Jenkins job?

@jrajahalme
Copy link
Member Author

/test-1.21-5.4

@jrajahalme
Copy link
Member Author

/test-1.16-4.9

@jrajahalme jrajahalme force-pushed the ipcache-close-after-no-dump-support branch from 12a035d to 891919a Compare April 20, 2022 14:28
@jrajahalme jrajahalme requested a review from a team April 20, 2022 14:28
@jrajahalme
Copy link
Member Author

/test

@jrajahalme
Copy link
Member Author

/test-1.16-4.9

@jrajahalme
Copy link
Member Author

/test-1.21-5.4

@jrajahalme
Copy link
Member Author

/test-1.22-4.19

@jrajahalme
Copy link
Member Author

/test-1.23-net-next

@jrajahalme
Copy link
Member Author

Restarted tests due to VM provisioning fails.

@jrajahalme
Copy link
Member Author

jrajahalme commented Apr 20, 2022

/test-1.22-4.19

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sDatapathConfig DirectRouting Check connectivity with direct routing

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@jrajahalme jrajahalme changed the title daemon: close ipcache map also after SupportDump returns false daemon: Dump ipcache without probing for support Apr 20, 2022
@jrajahalme jrajahalme added release-blocker/1.10 release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-blocker/1.12 This issue will prevent the release of the next version of Cilium. labels Apr 20, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.5 Apr 20, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.11 Apr 20, 2022
@jrajahalme
Copy link
Member Author

/test

@jrajahalme
Copy link
Member Author

/ci-awscni

@joestringer
Copy link
Member

joestringer commented Apr 25, 2022

ConformanceAKS is failing during cleanup, but this job is known to be failing and not currently marked as required.
Conformance AWS-CNI appears to be failing consistently with the test code error @jrajahalme posted above. Note that the same failure includes other more specific details (mentioned below) that points out that it's some bad interaction with the CLI. This is unrelated to the current PR which is fixing a known code issue in the daemon.

Unable to parse the provided version "-ci:346a508e8254881ba4575a57c3d66c054c33318f", assuming v1.11.3 for ConfigMap compatibility

All other testsuites are passing.

Bypassing to merge this fix.

@joestringer joestringer merged commit b61a347 into cilium:master Apr 25, 2022
@jrajahalme jrajahalme added backport-pending/1.10 and removed needs-backport/1.10 release-blocker/1.12 This issue will prevent the release of the next version of Cilium. labels Apr 26, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.5 Apr 26, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.11 Apr 26, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.11 Apr 29, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.11 Apr 29, 2022
@joestringer joestringer moved this from Backport pending to v1.10 to Backport pending to v1.11 in 1.11.5 Apr 29, 2022
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 4, 2022
joamaki added a commit to joamaki/cilium that referenced this pull request May 5, 2022
This makes SupportsDelete() use a separate map for the probing so it'll
work even if the map hasn't been pinned yet, allowing it to be used early
in the init and hopefully preventing issues like one fixed by cilium#19501.

Fixes: cilium#19619
Signed-off-by: Jussi Maki <jussi@isovalent.com>
kaworu pushed a commit that referenced this pull request May 6, 2022
This makes SupportsDelete() use a separate map for the probing so it'll
work even if the map hasn't been pinned yet, allowing it to be used early
in the init and hopefully preventing issues like one fixed by #19501.

Fixes: #19619
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@aanm aanm moved this from Backport pending to v1.11 to Backport done to v1.11 in 1.11.5 May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.11
Backport done to v1.10
1.11.5
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

Cilium Upgrade to 1.11.4 fails
4 participants