Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
panic in xfs_trans_free_items -- null pointer #2592
We're seeing panics in the XFS write path during page faults. This is on hosts running Cassandra in a fairly large C* cluster. So, in general, these are writes to mmap'd files on a large XFS partition (~3.5TB).
oops/panic pstore logs here: https://gist.github.com/grepory/cf0972eacf257ea4a3a4986589708d6a
These hosts were previously running CentOS with the same version of Cassandra/JVM and firmware without issue. They were recently reprovisioned with CoreOS and have since begun panicking.
There are two stack traces that we've observed:
Container Linux Version
We are running on a mixture of Dell PowerEdge r730's and r740xd's
I have tried a number of configurations of fio to try to reproduce this, but have failed miserably. I also have not seen this behavior with Cassandra on a single node running cassandra-stress against it. It's very probably related to the load these machines are under (load averages ~50 on machines with 2 cpus x 12 cores (the r730s) or 2 cpus x 22 cores (r740s). Hyperthreading is enabled.
Thanks for the report.
Do you know if older versions had this problem (i.e. the problem is new with 2079.3.0) or is this the only version you've ran?
You're on a slightly out of date version of CL; you should update to the latest stable and see if the issue persists. You also might try alpha, as it has a newer kernel that might have a fix. If it does, that makes the problem much easier to track down and fix.
This is the only version we have tried thus far. I can suggest we attempt to use a different version. I don’t know how comfortable people will be with running alpha in production, but I mean hey. It’s already crashing, right? :) will update when I can. -- greg poirier Old dancers never die, they just leap from barre to barre.
We are working on updating CoreOS to try a newer kernel version. For some additional context, prior to running CoreOS they were CentOS with kernel versions:
I know it’s kind of apples and oranges outside of the kernel, but wanted to make a note of that.
Yeah. We don’t really know how to install specific versions of CoreOS. I only saw some user group posts discussing it and that it requires the enterprise management stuff. If you can give us some guidance on how to install specific versions that would be helpful.
Also, if you know how to make the kernel dump core, I would appreciate being able to dig in a bit deeper. Can I just add a systemd-coredump service unit?
If you're using the
As for getting kernel core dumps I'm not sure. Looks like we do ship the kdump tool, so maybe look into that?
Hope that helps.
We're working on rolling back to a known working version of CoreOS (1967.6.0).
I haven't been able to get a crash kernel to work or obtain a core dump, but I am starting to look into diffs between 4.14.96 and 4.19.34. I dug through the patches CoreOS applied to the kernel between these two releases, but I found nothing changed along the path where our crashes are happening.