-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847
Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847
Comments
Thanks for reporting, in https://www.flatcar.org/releases#release-3227.2.2 we have the links to the kernel changelogs for each bugfix release since the last Flatcar stable release. We need to find suspicious changes and try to see if the problem goes away if they are reverted. |
Could not reproduce it using QEMU so far. I tried plenty of reboots and letting the VM ran for while using one of our ignition templates. Also starting with 3227.2.1 and updating did not trigger it. :( |
We think we're having this same problem. It seems to manifest most often as sshd hanging. We'll keep a test server running things and hopefully get some idea about what's going on, but probably can't help with a reproduction case if QEMU isn't working. |
We are still seeing lock-ups of nodes with 3227.2.2 on a daily basis in our fleet but still fail to reproduce the error some what consistently. Affected nodes are hanging, login via ssh or console is just haging. We are always seeing stack traces of "hung task timeouts" that are io/fs related. |
The main node we see hangs on has several cephfs mounts. This kernel module has been problematic in the past: does anyone else with this problem use cephfs? |
We experienced the issue with ext4 on vmware disks. |
my company see's these issues aswell under vmware and 3227.2.2 will be ready to provide more info if necessary. |
The servers in our company have the same symptoms after an update to 3227.2.2. We can reproduce it as soon as we generate a lot of IO on the hard disk. Unfortunately I don't see any errors in systemd / kernel ring buffer. Only As a temporary solution we did a rollback to 3227.2.1 for now. |
Any specific you executed for that? I shuffled around some 100 GB's using |
I could reproduce hung task errors on bare metal by running No further negative consequences, esp. after shutting down stress. No more errors AFAICS and load is still usual.
|
We tried to reproduce the case again today in our test cluster. We installed one node with flatcar version 3227.2.1 and the other with 3227.2.2. Both nodes have the following hardware
We have used dd, stress as well as stress-ng, unsuccessfully! We had a high load, but the 3227.2.2 nodes did not crash. Then we crafted the following K8S resource with a small Golang tool. And this actually managed to crash the 3227.2.2 node after about 5 minutes. SSH and other services were no longer accessible. On the 3227.2.1 node the script ran until the filesystem reported full inodes, but it was still reachable.
When the 3227.2.2 node crashed, the kubelet process was marked as a zombie and in the kernel ring buffer we saw the following errors
cc (@damoon) |
We are also experiencing this problem but on 3227.2.1 that we have updated some packages of. On 3227.2.1 the only updates we made were containerd 1.6.8 and kernel 5.15.62. When this issues occurs, our CPU metrics show that containerd jumps to 90% utilisation of the node's CPU and some services such as SSHD stop responding. As stated in previous replies the node becomes entirely unresponsive and kubernetes reports the node as unreachable. The node doesnt recover unless force restarted or replaced by a cloud provider. I have tried downgrading the containerd version to 1.6.6 (which worked fine for us on a previous flatcar version), and I have also tried upgrading the kernel to 5.15.68, both of which did not resolve the problem. Comparing the kernel versions of upstream 3227.2.1 and our current stable flatcar version (3139.2.3) I see 5.15.58 and 5.15.54 respectively. So potentially some kernel version after this is causing the issue? |
@Champ-Goblem since you have decent testing capabilities, would you mind also reproducing this with 5.15.70 kernel? Looks that it will be in the next stable release flatcar-archive/coreos-overlay@ea82f18 |
I was not able to repro with the reproducer in my own testing, but an infrastructure VM that we have on the stable channel has also hit this (14 days of uptime, no particular io load): log:
|
State of our research so far:
The suspicious commit has not been reverted yet, and is still in main (therefore, also in the 5.15 LTS kernel series). We will continue to closely monitor the situation. |
@Champ-Goblem Since we still have trouble reproducing this on our end, would it be possible for you to apply below patch to 5.15.63 (i.e. 3227.2.2) and check if you can still reproduce the issue? The patch reverts the suspicious commit 51ae846cff5. We have just started a discussion on linux-ext4 (replying to the syzbot thread) to raise awareness of potential issues with 51ae846cff5. A clear bisect with the patch reverted (and no issues caused) would help a lot in this discussion. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 98f381f6fc18..072a9bc931fc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3139,15 +3139,13 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
{
struct inode *inode = mapping->host;
journal_t *journal;
- sector_t ret = 0;
int err;
- inode_lock_shared(inode);
/*
* We can get here for an inline file via the FIBMAP ioctl
*/
if (ext4_has_inline_data(inode))
- goto out;
+ return 0;
if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
test_opt(inode->i_sb, DELALLOC)) {
@@ -3186,14 +3184,10 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
jbd2_journal_unlock_updates(journal);
if (err)
- goto out;
+ return 0;
}
- ret = iomap_bmap(mapping, block, &ext4_iomap_ops);
-
-out:
- inode_unlock_shared(inode);
- return ret;
+ return iomap_bmap(mapping, block, &ext4_iomap_ops);
}
static int ext4_readpage(struct file *file, struct page *page) |
@t-lo Yep I can give this a go, although I may not be able to provide results till some time next week |
Should we help by building the patched image? |
I have already started a build on our infra for the patched image, so that should be fine, it will probably take an hour and a bit for this to build. Ill look at getting it rolled out and start testing by the afternoon, so hopefully ill have some results either later today or tomorrow some time. |
Okay I think the patch provided above fixes the problem. Ive deployed two nodes, one running stock 5.15.63 and a second running with the patch. |
Awesome, thank you very much for testing! We're discussing this issue with Jan Kara, an Ext4 maintainer, upstream. (https://www.spinics.net/lists/linux-ext4/msg85417.html ff., messages from today will become available in the archive tomorrow). |
@Champ-Goblem would you be able to rebuild and test and image with the 5.15.63 kernel, without the revert but with this PR applied: flatcar-archive/coreos-overlay#2196? This should fix the issue with the stacktraces being slightly off, and help upstream come up with a proper fix. |
I have rebuilt flatcar with the above PR, below should be the logs from the system during the failure.
|
Thanks, @Champ-Goblem, I'll forward that to the mailing list, any chance you can also provide the following:
|
@jepio I think I have managed to get some of the list of hung tasks, I cant be sure how complete the list is, but I think it should be a good starting point nonetheless |
We also experience this. Any ideas if and when this will be solved? Is this certain that's vanilla kernel issue? If so, other distos should have the same problem as well. |
There's a fix available now (https://lore.kernel.org/linux-ext4/20221122115715.kxqhsk2xs4nrofyb@quack3/T/#ma65f0113b4b2f9259b8f7d8af16b8cb351728d20) and I've tested it on 5.15 with the reproducer in this thread, I can no longer reproduce the hang. This should now get merged upstream and subsequently backported. Then it'll be included in a Flatcar stable release. |
Thanks for all the work that went into fixing this! Any advise were I can track if the patch has landed upstream? We have stopped flatcar OS upgrades for 3 month now and our PCI compliance is not looking too good... 😬 |
@databus23 We have worked around this by changing the root FS from Ext4 to XFS. Our tests look good so far but haven't rolled it out to production yet. In case you need to upgrade flatcar for some reason, this might be an alternative idea for you until the ext4 fix has made it's way up to a flatcar stable release. |
We have been testing the new patch [1] which we have applied against 5.15.80 in a custom build and it is currently looking stable again on our development clusters with Flatcar 3227.2.2 |
Unfortunately there hasn't been any response from the ext4 maintainer, I've poked them this week. In the meantime we'll likely take the backport into the next alpha/beta Flatcar release next week. @Champ-Goblem thanks all the help with testing and reproducing, it's good to have a datapoint confirming that no new issues pop-up with the patch. |
When is it in stable? |
Didn't mean to close this. It will land in stable in 1-2 release cycles (next time beta promotes), or when the bugfix lands in the stable trees. If you're hitting this issue and have been holding off on updates - do update to the beta release that will come this week. |
is this still not in stable? I've tought it already is and updated to |
Hi @schmitch |
No, it's added to beta in |
Correct. The fix is not in Stable yet, but only in Beta & Alpha. Good news is, the ext4 deadlock fix was recently merged to the mainline Kernel. Though that is still not included in any Kernel release. Looks like it needs a little more time. |
I can confirm that this is still an issue on the stable track, as it is still on version |
Looking forward to the next stable channel release to solve this problem. |
No more tears with beta 3432.1.0. |
The fix is now released with Kernel 6.1.4 |
And i'm working on getting it into 5.15.x. https://lore.kernel.org/stable/1672844851195248@kroah.com/ |
the fix is queued up for 5.15.87 https://lore.kernel.org/stable/5035a51d-2fb3-9044-7b70-1df33af37e5f@linuxfoundation.org/T/#m39683920478da269a295cc907332a5f20e6122f5 |
Yesterdays release did not include the bugfix for this! was that intentional. as i am really waiting for this! If that will take a longer time, i would need to temporary switch to the beta channel, but that is a lot of work to switch everything :-( |
5.15.87 got released with the patch included! Could you please release a new version of flatcar-linux including the new kernel release!? |
We'll make sure the fix (either 5.15.87 or the backport) is part of the next stable release scheduled for 2022-01-23. |
Flatcar Stable 3374.2.3 was released with the bugfix, Kernel 5.15.86 with the backport. |
Looks good now |
@mduerr3cgi that does not look like the same issue. Can you open a new issue and paste full logs (and test the 2 more recent beta releases). |
I assume that this issue was resolved in Stable 3374.2.3. |
Description
We've seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, ...) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring at random.
Impact
This causes Kubernetes nodes to become
NotReady
before being drained, which involves volumes not being able to be moved and therefore service interruptions.Environment and steps to reproduce
task blocked for more than 120 seconds
errors and related call stacks, see the screenshotExpected behavior
The nodes do not stall completely.
Additional information
We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.
cc @databus23 @defo89 @jknipper
The text was updated successfully, but these errors were encountered: