Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vagrant: bump all Vagrant box versions #17394

Merged
merged 2 commits into from
Oct 11, 2021
Merged

vagrant: bump all Vagrant box versions #17394

merged 2 commits into from
Oct 11, 2021

Conversation

tklauser
Copy link
Member

@tklauser tklauser commented Sep 14, 2021

Mostly to pick up Go 1.17

Supersedes #17352 which used the wrong branch name, see #17352 (comment) Otherwise no changes to the PR contents.

@tklauser tklauser added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Sep 14, 2021
@tklauser tklauser requested a review from a team as a code owner September 14, 2021 15:34
@tklauser
Copy link
Member Author

test-me-please

@aanm
Copy link
Member

aanm commented Sep 15, 2021

test-1.16-netnext

@tklauser
Copy link
Member Author

test-1.16-netnext

Looks like this is timing out on waiting for the kubelet to become ready: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1445/ I think this is also what happened on the previous run and on the previous PR #17352.

I've tried to reproduce this locally but so far wasn't able to (i.e. the VMs always come up as expected with the new VM image).

@tklauser
Copy link
Member Author

tklauser commented Sep 16, 2021

test-me-please

test-1.16-netnext

Looks like this is timing out on waiting for the kubelet to become ready: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1445/ I think this is also what happened on the previous run and on the previous PR #17352.

I've tried to reproduce this locally but so far wasn't able to (i.e. the VMs always come up as expected with the new VM image).

I was using the dev VM instructions instead of the CI VM instructions (thanks @nbusseneau for pointing this out). Using
K8S_VERSION=1.16 K8S_NODES=1 NETNEXT=1 ./vagrant-local-start.sh I was able to reproduce the VM provisioning failure. It looks like kubelet doesn't come up:

vagrant@k8s1:~$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Thu 2021-09-16 13:21:21 UTC; 86ms ago
     Docs: https://kubernetes.io/docs/home/
  Process: 11042 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
 Main PID: 11042 (code=exited, status=255)
vagrant@k8s1:~$ journalctl -xeu kubelet
[...]
Sep 16 13:22:13 k8s1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit kubelet.service has finished starting up.
-- 
-- The start-up result is RESULT.
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See htt
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See http
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See htt
Sep 16 13:22:13 k8s1 kubelet[11384]: Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See http
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.995746   11384 server.go:410] Version: v1.16.15
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.996551   11384 plugins.go:100] No cloud provider specified.
Sep 16 13:22:13 k8s1 kubelet[11384]: I0916 13:22:13.996614   11384 server.go:773] Client rotation is on, will bootstrap in background
Sep 16 13:22:14 k8s1 kubelet[11384]: I0916 13:22:14.009873   11384 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Sep 16 13:22:14 k8s1 kubelet[11384]: I0916 13:22:14.234620   11384 server.go:636] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Sep 16 13:22:14 k8s1 kubelet[11384]: F0916 13:22:14.235267   11384 server.go:271] failed to run Kubelet: strconv.Atoi: parsing "1\n\x00": invalid syntax
Sep 16 13:22:14 k8s1 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Sep 16 13:22:14 k8s1 systemd[1]: kubelet.service: Failed with result 'exit-code'.

The following looks like the culprit for failing to start kubelet:

Sep 16 13:22:14 k8s1 kubelet[11384]: F0916 13:22:14.235267   11384 server.go:271] failed to run Kubelet: strconv.Atoi: parsing "1\n\x00": invalid syntax

Not sure where that 1\n\x00 is coming from. Maybe command line or a config file? Will try to dig further...

@tklauser
Copy link
Member Author

tklauser commented Sep 16, 2021

Starting kubelet with --v=8 unfortunately doesn't give any further clues.

But it smells a bit like that value could come from a sysctl or sysfs file. And indeed, if I check for uses of strconv.Atoi in code around the last log messages I noticed this:

https://github.com/kubernetes/kubernetes/blob/2adc8d7091e89b6e3ca8d048140618ec89b39369/pkg/kubelet/cm/cpuset/cpuset.go#L258-L297

which gets called from here:

https://github.com/kubernetes/kubernetes/blob/2adc8d7091e89b6e3ca8d048140618ec89b39369/pkg/kubelet/cm/cpumanager/topology/topology.go#L310-L325

Comparing the content of that file for the net-next kernel :

vagrant@k8s1:~$ xxd /sys/devices/system/node/node0/cpulist
00000000: 302d 310a 00                             0-1..

with my development machine running kernel 5.11:

tklauser@orbital:~ $ xxd /sys/devices/system/node/node0/cpulist
00000000: 302d 370a                                0-7.

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

@tklauser
Copy link
Member Author

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

It's indeed a kernel regression which incidentally breaks kubectl. It looks like the regression was introduced by the series at https://lore.kernel.org/all/20210806110251.560-1-song.bao.hua@hisilicon.com/

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

@tklauser tklauser added the dont-merge/waiting-for-upstream Only merge once upstream library or kernel features have landed label Sep 20, 2021
@tklauser
Copy link
Member Author

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

Patch was applied to the driver-core tree and should hopefully hit Linus' tree for 5.15-rc3.

Marking this PR as draft until that is the case and bpf-next picked up the fix as well at which point we'd need to build new net-next VM images to pick up into this PR.

@tklauser tklauser marked this pull request as draft September 23, 2021 11:05
tklauser added a commit to cilium/packer-ci-build that referenced this pull request Oct 4, 2021
Add a patch [1] to the bpf-next kernel build which fixes a regression
which causes kubelet to error out on startup.

See [2] and following comments for details.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c86a2d9058c5a4a05d20ef89e699b7a6b2c89da6
[2] cilium/cilium#17394 (comment)

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Member Author

tklauser commented Oct 4, 2021

Notice the trailing null byte 00 in the output on the net-next kernel (the difference in the CPU numbers is expected as these are different systems). This looks like a kernel regression to me.

It's indeed a kernel regression which incidentally breaks kubectl. It looks like the regression was introduced by the series at https://lore.kernel.org/all/20210806110251.560-1-song.bao.hua@hisilicon.com/

I've sent a potential fix upstream: https://lore.kernel.org/lkml/20210916222705.13554-1-tklauser@distanz.ch/T/#u

Quick status update: The patch has made it upstream into Linus' tree but hasn't made it into the bpf-next tree yet. In order to unblock this PR and PRs depending on a Vagrant VM bump, I've opened cilium/packer-ci-build#284 which applies said patch manually before building the kernel for the bpf-next kernel image. Once that is approved and merged and a new VM image was built, we can finally move forward with this PR.

tklauser added a commit to cilium/packer-ci-build that referenced this pull request Oct 4, 2021
Add a patch [1] to the bpf-next kernel build which fixes a regression
which causes kubelet to error out on startup.

See [2] and following comments for details.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c86a2d9058c5a4a05d20ef89e699b7a6b2c89da6
[2] cilium/cilium#17394 (comment)

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser removed the dont-merge/waiting-for-upstream Only merge once upstream library or kernel features have landed label Oct 5, 2021
@tklauser tklauser marked this pull request as ready for review October 5, 2021 09:39
@tklauser tklauser requested a review from a team as a code owner October 5, 2021 09:39
@tklauser
Copy link
Member Author

tklauser commented Oct 5, 2021

/test

Updated to the latest VM image versions (namely to include the kernel patch mentioned above) and included a revert of commit 1c42ae9 as suggested by Paul.

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Test suite timed out after 3h0m0s

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@tklauser
Copy link
Member Author

tklauser commented Oct 5, 2021

/test

@tklauser tklauser closed this Oct 5, 2021
@tklauser tklauser reopened this Oct 5, 2021
@tklauser tklauser closed this Oct 5, 2021
@tklauser tklauser reopened this Oct 5, 2021
@tklauser
Copy link
Member Author

tklauser commented Oct 5, 2021

/test

@tklauser tklauser force-pushed the pr/update-packer-ci-build branch 2 times, most recently from 7c71eb5 to c4d8872 Compare October 6, 2021 14:10
@tklauser
Copy link
Member Author

tklauser commented Oct 6, 2021

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@tklauser
Copy link
Member Author

tklauser commented Oct 7, 2021

/test

@tklauser
Copy link
Member Author

tklauser commented Oct 7, 2021

test-runtime

Failed to import image: https://jenkins.cilium.io/job/Cilium-PR-Runtime-net-next/303/

Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.36.11:80 from testclient-69hxt

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.19 so I can create a new GitHub issue to track it.

@tklauser
Copy link
Member Author

tklauser commented Oct 7, 2021

ci-eks

Failed to create cluster: https://github.com/cilium/cilium/runs/3824163280

Pick up Go 1.17 and ipset.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
This reverts commit 1c42ae9.

Reason for revert: ipset is now provided in the vagrant images

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Member Author

/test

@tklauser tklauser added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 11, 2021
@tklauser
Copy link
Member Author

CI finally passed 🎉 (except for AKS which is disabled) and reviews are in. Marked as ready to merge.

@aanm aanm merged commit fda0d84 into master Oct 11, 2021
@aanm aanm deleted the pr/update-packer-ci-build branch October 11, 2021 20:26
twpayne added a commit that referenced this pull request Oct 12, 2021
Now that the CI infrastructure runs Go 1.17 everywhere (#17394), we can
use Go 1.17 features like math.MaxInt.

Signed-off-by: Tom Payne <tom@isovalent.com>
aanm pushed a commit that referenced this pull request Oct 12, 2021
Now that the CI infrastructure runs Go 1.17 everywhere (#17394), we can
use Go 1.17 features like math.MaxInt.

Signed-off-by: Tom Payne <tom@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants