Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ceph errors on the 6.0 kernel #1393

Closed
roysjosh opened this issue Jan 27, 2023 · 9 comments
Closed

fix ceph errors on the 6.0 kernel #1393

roysjosh opened this issue Jan 27, 2023 · 9 comments
Labels

Comments

@roysjosh
Copy link

Describe the bug

I have 2 master nodes and 1 worker node, all running Fedora CoreOS. All 3 are VMs. The two master nodes have two SR-IOV VFs from the host's ConnectX-3 NIC: a LAN interface and a BGP peering interface to a HA router pair via kube-vip. The worker node only has a LAN VF, SR-IOV as well but from an Intel I350 NIC. k8s was set up with kubespray and has been upgraded to kubespray v2.20.0, k8s v1.24.6. It is managed with flux2 and rook v1.10.10 is deployed to provide storage via ceph rbd using a NVMe drive on each node.

When one of my nodes upgraded via zincati to 37.20230110.3.1, I noticed that anything relying on storage was no longer functioning. There were many different errors from different points in the system.

The kernel said:
Jan 27 03:08:43 k8s-w-node3.lan kernel: libceph: osd2 (1)10.233.66.114:6800 socket closed (con state V1_BANNER)

I checked cilium thoroughly at this point, running through all of the troubleshooting steps to see if any traffic was being dropped. I ended up doing a pcap via toolbox on port 6800. There was plenty of traffic but occasionally a client would connect with a "v027" banner and get disconnected quickly. I don't know enough about ceph to comment further on this.

Although ceph said it was HEALTH_OK, an OSD was said to have "slow ops". It seemed to match up with the OSD on the upgraded node. Hopping into the rbdplugin container I noticed a bunch of hung fsck commands. I don't have this on my screen any longer.

Here are some bits of the kernel log from when things went bad:

Jan 25 21:48:39 k8s-m-helium.lan lvm[904]:   pvscan[904] PV /dev/vdb online, VG ceph-3d0a6c8c-04f1-4bcc-b000-65d11621cdbc is complete.
Jan 25 21:48:39 k8s-m-helium.lan lvm[904]:   pvscan[904] VG ceph-3d0a6c8c-04f1-4bcc-b000-65d11621cdbc run autoactivation.
Jan 25 21:48:39 k8s-m-helium.lan lvm[904]:   1 logical volume(s) in volume group "ceph-3d0a6c8c-04f1-4bcc-b000-65d11621cdbc" now active
Jan 25 21:48:39 k8s-m-helium.lan lvm[840]:   1 logical volume(s) in volume group "ceph-3d0a6c8c-04f1-4bcc-b000-65d11621cdbc" monitored
Jan 25 21:50:04 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10056): osd0 up
Jan 26 19:31:40 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.252:6801 socket closed (con state OPEN)
Jan 26 19:31:42 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.252:6801 socket closed (con state V1_BANNER)
Jan 26 19:32:07 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10061): osd2 down
Jan 26 19:34:20 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10064): osd2 up
Jan 26 19:34:23 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:23 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:23 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:25 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:27 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:31 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:39 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:34:55 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
Jan 26 19:35:11 k8s-m-helium.lan kernel: libceph: osd2 (1)10.233.66.58:6800 socket closed (con state V1_BANNER)
[many, many more]
Jan 26 22:47:05 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10071): osd0 down
Jan 26 22:47:26 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10074): osd0 up
Jan 26 23:27:16 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10078): osd1 down
Jan 26 23:37:26 k8s-m-helium.lan kubelet[1572]: E0126 23:37:26.518996    1572 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/system-rook-ceph.rbd.csi.ceph.com^0001-0010-system-rook-ceph-0000000000000001-fccad5b9-3ced-11ec-b7a2-72ca10426e5e podName: nodeName:}" failed. No retries permitted until 2023-01-26 23:37:30.518984562 +0000 UTC m=+92919.370504673 (durationBeforeRetry 4s). Error: MountVolume.MountDevice failed for volume "pvc-912adffd-da3d-4422-9e87-06cbeed69dda" (UniqueName: "kubernetes.io/csi/system-rook-ceph.rbd.csi.ceph.com^0001-0010-system-rook-ceph-0000000000000001-fccad5b9-3ced-11ec-b7a2-72ca10426e5e") pod "vaultwarden-69cd55c5cf-m25ms" (UID: "14159946-01c5-4af0-a7ae-60620257495d") : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0010-system-rook-ceph-0000000000000001-fccad5b9-3ced-11ec-b7a2-72ca10426e5e already exists
Jan 26 23:36:37 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10081): osd1 up
Jan 27 00:03:40 k8s-m-helium.lan kernel: libceph (807fd24b-1225-4cdb-87b4-622a9bd3365f e10083): osd0 down

I'd be happy to send a full log directly to someone or do any leg work to help debug this.

Reproduction steps

  1. upgrade any node, worker or master, to 37.20230110.3.1
  2. k8s is essentially nonfunctional

Expected behavior

Unicorns and rainbows

Actual behavior

Storage issues preventing proper operation/creation of pods

System details

$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: inactive
Deployments:
  fedora:fedora/x86_64/coreos/stable
                  Version: 37.20230110.3.1 (2023-01-24T16:21:39Z)
               BaseCommit: fc214c16d248686d4cf2bb3050b59c559f091692d7af3b07ef896f1b8ab2f161
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
                     Diff: 51 upgraded, 1 removed
          LayeredPackages: conntrack-tools ipset python python3-libselinux

● fedora:fedora/x86_64/coreos/stable
                  Version: 37.20221225.3.0 (2023-01-09T18:38:02Z)
               BaseCommit: d477f98d52bf707d4282f6835b85bed3d60e305a0cf6eb8effd4db4b89607f05
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: conntrack-tools ipset python python3-libselinux

Ignition config

No response

Additional information

No response

@darknao
Copy link

darknao commented Jan 27, 2023

I confirm I have pretty much the same issue, with a slightly different setup:
My setup is a k3s cluster with Ceph 17.2.3 (managed by Rook v1.10.10) on top.
1 control-plane & 1 worker are VM, hpnodeX nodes are baremetal with Ceph OSD.

As soon as I upgrade an OSD node to 37.20230110.3.1, the Ceph cluster becomes unresponsive.
So far, I didn't notice any issue on upgraded nodes without OSD.

NAME          STATUS                     ROLES                  AGE      VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                        KERNEL-VERSION           CONTAINER-RUNTIME
fcos          Ready                      control-plane,master   2y130d   v1.25.5+k3s2   192.168.0.84   <none>        Fedora CoreOS 37.20230110.3.1   6.0.18-300.fc37.x86_64   containerd://1.6.14-k3s1
fcos-worker   Ready                      <none>                 2y125d   v1.25.5+k3s2   192.168.0.83   <none>        Fedora CoreOS 37.20230110.3.1   6.0.18-300.fc37.x86_64   containerd://1.6.14-k3s1
hpnode1       Ready                      <none>                 346d     v1.25.5+k3s2   192.168.0.80   <none>        Fedora CoreOS 37.20221225.3.0   6.0.15-300.fc37.x86_64   containerd://1.6.14-k3s1
hpnode2       Ready                      <none>                 346d     v1.25.5+k3s2   192.168.0.81   <none>        Fedora CoreOS 37.20221225.3.0   6.0.15-300.fc37.x86_64   containerd://1.6.14-k3s1
hpnode3       Ready,SchedulingDisabled   <none>                 147d     v1.25.6+k3s1   192.168.0.82   <none>        Fedora CoreOS 37.20230110.3.1   6.0.15-300.fc37.x86_64   containerd://1.6.15-k3s1

Thanks to @dustymabe for the idea, I tried the 6.0.15-300 kernel with 37.20230110.3.1 (on hpnode3), and so far, it seems to work just fine. So it sounds like it's kernel related.

In case you want to try that yourself, the command to override the kernel is:

sudo systemctl stop zincati && sudo rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2022-90162a1d88 --reboot

@dustymabe
Copy link
Member

hey @darknao - thanks for the update and for testing out the 37.20230110.3.1 content with a kernel-6.0.15-300.fc37 override.

This tells us that there was a problem introduced in this transition:

kernel 6.0.15-300.fc37.x86_64 → 6.0.18-300.fc37.x86_64

To narrow down the search for the problem commit could you possibly try with the 6.0.16 and 6.0.17 kernels too?

  • kernel-6.0.16-300.fc37
    • sudo rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2023-1051069c3b
  • kernel-6.0.17-300.fc37
    • sudo rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2023-ffa70d2433

Sometimes with issues like these we also find that the problem has been found and fixed upstream already.. If you could try with the testing stream (with kernel-6.1.6-200.fc37). That would help too.

@darknao
Copy link

darknao commented Jan 27, 2023

Here are the results:

  • with 37.20230110.3.1 (stable)

    • 6.0.18-300.fc37 : FAIL
    • 6.0.17-300.fc37 : FAIL
    • 6.0.16-300.fc37 : FAIL
    • 6.0.15-300.fc37 : OK
  • with 37.20230122.2.0 (testing)

    • 6.1.6-200.fc37 : OK
  • with 37.20230122.1.1 (next)

    • 6.1.6-200.fc37 : OK

I'll keep one node on testing and report back if I see any issues, but so far so good.

@dustymabe
Copy link
Member

Oh interesting. So it's possible a new commit sometime between 6.0.16 and 6.1.6 fixes the issue.

Thank you for testing this and reporting the information!

@dustymabe
Copy link
Member

What's crazy is that there aren't any commits upstream to ceph between 6.0.15 and 6.0.16:

[dustymabe@media linux ((v6.0.16))]$ git log v6.0.15..v6.0.16 fs/ceph/ net/ceph/ include/linux/ceph/
[dustymabe@media linux ((v6.0.16))]$ 

@dustymabe dustymabe changed the title 37.20230110.3.1 takes my k8s cluster down ceph errors on kernel-6.0.18-300.fc37 Jan 27, 2023
@dustymabe dustymabe added the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Jan 27, 2023
@punycode
Copy link

What's crazy is that there aren't any commits upstream to ceph between 6.0.15 and 6.0.16:

[dustymabe@media linux ((v6.0.16))]$ git log v6.0.15..v6.0.16 fs/ceph/ net/ceph/ include/linux/ceph/
[dustymabe@media linux ((v6.0.16))]$ 

Not crazy at all. We run into fittingly similar problem, when the FCOS update hit. Except we did not watched the problem from the client (kernel RBD) side. We had a 100% missing PGs after the upgrade, with OSDs toggling between UP and DOWN constantly. I don't have the logs anymore (stupid me), but what we observed was this:

  • Ceph clients and Ceph OSD nodes are completely separate on our setup
  • OSD process where starting up (on the host network) and tried to reach their peers for their respective PGs
  • They did not manage to bind their respective ports, except one per host (ports starting at 6800)
  • netstat -tanp | grep ceph-osd was filled with TIME_WAIT
  • OSDs never became healthy, since they did not reach their assigned peers (except maybe, when the OSDs winning the race for port 6800 on 2 hosts where actually paired for some PGs)

My best guess is, that it's not a problem in the ceph client parts of the kernel, but something related to the way OSDs do networking. Maybe they could not figure out, which ports are already in use and all tried to bind the same base port without succeeding.

In the end we did rollback the OSD nodes to the previous release, while all other nodes followed through with the update. No problems since then, so we definitely would not blame the ceph client kernel code.

@rrayst
Copy link

rrayst commented Jan 30, 2023

@punycode sounds like we have the same problem. We are still running FCOS 37.20230110.3.1. So this could be investigated, if anyone is interested.

@roysjosh
Copy link
Author

@dustymabe
Copy link
Member

The fix for this went into stable stream release 37.20230122.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants