-
Notifications
You must be signed in to change notification settings - Fork 30
[1632.2.1] kernel deathspin on iSCSI session logout without deleting scsi devices before #2357
Comments
|
Shows that 21248 is upset
|
Somewhat related https://patchwork.kernel.org/patch/7987711/ |
Difference between "stable" nodes:
It seems I can reliably reproduce test case. All is needed is to run POD with PersistentVolumeClaim, which then dynamically provisions PersistentVolume (we use netapp/trident for that), then delete the pod so that no more iSCSI volumes are used on the servers. AFAIK kubelet then triggers iSCSI session disconnect, which might be triggering some bug in the kernel |
Doesn't show any iscsi commands which are different from freshly started and cordoned node (== no iSCSI volumes mounted, but iSCSI discovery is completed and disks are seen in multipath -ll):
|
enabling scsi full logging (from https://github.com/hreinecke/sg3_utils/blob/master/scripts/scsi_logging_level)
shows no commands going to iscsi devices, just occasional sda writes, so it is not some sort of retry or SAN error |
kubelet errors:
|
Remaining mounts on the host:
rkt pod is kubelet |
Running command which kubelet tries to run:
iscsid logs:
I deleted all scsi devices from that session manually:
Now session looks like following:
|
It seems that nothing needs to be mounted. I masked kubelet, rebooted node. Simple logout run Debug
|
Logging out from all targets sends 4 kworkers into infinite spin. same as Found safe sequence to logout, which doesn't cause problems and succeeds:
|
Asked for help in open-iscsi maillist: https://groups.google.com/forum/#!topic/open-iscsi/Tc6ERb1QOBg |
"Bisecting":
Interesting observation, on 1632.2.1 (didn't try other) on clean logout (multipath -F, delete scsi devices, iscsiadm logout):
which is exactly 23 (number of LUNs) * 4 (number of portals) , so even scsi devices are gone, alua reference count didn't change. |
Hm, interesting! masking |
With
|
"Bisecting": |
Reproduced on fedora. Fedora kernels print more info to dmesg: |
git bisect points to fbce4d97fd ("scsi: fixup kernel warning during rmmod()") , posted findings to lkml: https://lkml.org/lkml/2018/2/18/89 |
cherry-picking commit 81b6c9998979 ('scsi: core: check for device state in __scsi_remove_target()') on top of 4.14.20 fixes the problem for me |
Thanks for tracking this down. I'll cherry-pick the commit onto our 4.14.20 branch so it should be in the next releases. |
I suppose there are no reasons why this patch can't be cherry-picked to 4.14 stable branch by upstream. I'll ask them, lets see what they say |
@dm0- patch was pulled into 4.14-stable branch and will be released in 4.14.21: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=a9ca15212fc81273f4f9f5efdececda0d4c06808 |
Thanks, it looks like it will be released Friday, so I will update to it then to get it into next week's alpha. |
Any chance kernel can be updated in stable without rest of updates? |
Yes, we update the LTS kernel in all channels. The next stable minor release isn't scheduled, though, so the kernel update will be queued for release whenever there is an important security issue or bug fix that warrants it. (Unless a beta gets promoted to stable first, which happens in about five weeks.) |
@dm0- , is there an easy way to rebuild stable image with 4.14.21 kernel locally? |
You could run this in the SDK:
Then use the |
Oh, and I forgot our test images are public, so you could download this image (which also has a signature): http://builds.developer.core-os.net/boards/amd64-usr/1632.3.0%2Bjenkins2-build-1632%2Blocal-1195/coreos_production_image.bin.bz2 . Note that is stable with 4.14.22. |
Interesting! so I don't need to build anything? :) Where is full index of test images? http://builds.developer.core-os.net/boards/amd64-usr/ doesn't have 1632.3.0, even if your full URL works. |
Sorry, indexes are not generated for the development bucket, and unfortunately the directory names are unpredictable due to Jenkins build numbers being included in the OS version. There also is no easy way to determine the changes in a development build without access to an internal CI interface. (You could technically look up the build number in https://github.com/coreos/manifest-builds/tags and trace back manifest commits if you were so determined.) Images for production releases such as 1632.3.0 are in a different bucket (which does generate indexes): https://stable.release.core-os.net/amd64-usr/1632.3.0 |
The bucket doesn't have up-to-date html indexes generated, but it's a google-cloud-storage bucket that allows anonymous read. It can be accessed via the google cloud web console (https://console.cloud.google.com/storage/browser/builds.developer.core-os.net/boards/amd64-usr) or via The console link requires some form of google login, but the api doesn't (e.g. the json list of objects works with no auth). Of course, none of the above is documented... maybe we should generate indexes again... |
I'm going to close this issue since the fixed kernel version is queued in all channels, and beta and alpha are releasing tomorrow. |
We are hitting exactly the same issue. Big thanks to @redbaron for digging this down. |
@r7vme Fixed kernels are in every channel now, so you might be seeing a different problem. Do you have error messages or backtraces? |
@dm0- We are on older kernel |
Bug
Container Linux Version
Environment
OEM=vmware_raw
iSCSI mounted disks
Actual Behavior
Now new containers can be started. Already running containers seem to be fine
Other Information
top top
Other Information
I'll add more into once I have it. Server is live and I am not rebooting it, so if you want me to collect some diagnostic I can run commands for you
The text was updated successfully, but these errors were encountered: