-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panics with use_after_free on 2765.2.4 #427
Comments
Thanks for the report. AFAIK, the upstream stable v5.10 tree has no fix about that issue yet. However, the mentioned fix, and its relevant commits like 1 and 2 are already included since Kernel ~v5.7. (credit to @alban ) My theory is, the remaining refcount issue had been hidden like that for some time, and recently it was somehow uncovered by other changes. |
thanks for the time spent looking at this @dongsupark. I've opened a bug report which is here for reference https://bugzilla.kernel.org/show_bug.cgi?id=213783 |
Hi @glitchcrab, did you get a chance to reproduce the issue with latest Flatcar releases? They ship kernel 5.15 - that would be interesting to see if you still have these panics... |
@tormath1 I'm afraid not, we were experiencing the issues on a platform of ours which is now EOL and so we aren't investing any more time into upgrade work |
Closing now, please tell if this is still an issue on the latest Stable. |
Description
We have observed a kernel panic in the networking stack on
5.10.37-flatcar
:Impact
Impact is currently limited large clusters which have pod counts and high levels of pod churn. Impact is currently relatively low, in a cluster with 40 workers we see the issue occurring on 5 workers.
Environment and steps to reproduce
Expected behavior
The kernel panic should not occur.
Additional information
The symptoms seem to differ slightly to when we saw this bug on AWS; in this case the networking stack seems to end up semi-broken instead of completely dead.
The issue appears to manifest on larger clusters (both clusters where we saw it have 40+ nodes) which have a large number of pods (both clusters have 1400+ pods). Additionally, the cluster which was affected the worst also has a high pod churn - often pods have 1000+ restarts in 24 hours.
Observed symptoms
QEMU args
Below is the full commandline which launches the machine, however I don't think that this is going to be that helpful as we experienced what appears to be the same bug previous on AWS.
The text was updated successfully, but these errors were encountered: