No system logs to identify the issue #2226
Comments
In short, timedated and timesyncd become unresponsive, so systemd tries to kill them -- and fails. This implies that the system as a whole is already quite bogged down, so it's not surprising that it soon stops responding altogether. This could be a symptom of a kernel problem or of extremely heavy I/O. Are there any hints earlier in the logs? |
This is another instance happened today. This time there are no signs of systemd trying to kill unresponsive services. Logging suddenly stopped. There is no heavy I/O on our environment. At the time VM was unresponsive, average disk usage was 313KBps and 40ms highest latency. How can I get more logs from kernel to know the point of failure ? |
Maybe add more memory and see what happens. |
From the logs, it looks as though the VM has 60 GB of RAM? You should be able to have ESXi connect the VM's first serial port to a file. Any kernel logs that don't make it to disk should still be printed to the serial port. |
@glevand We were using 17211820 Kb of memory out of 60GB at the time VM hung. |
The negative CPU consumption is strange. Also, that's a very high load average. Are there many processes in D state? The best way to obtain more troubleshooting information would be to collect kernel logs from the VM's virtual serial port at the time of the crash. |
@bgilbert, Thanks for the suggestion. We are seeing this issue from 4.13.x kernel onwards. We have rolled back couple of our VMs to 4.12.14 and we did not see crash for a week.
|
The lack of log messages from the guest could imply a hypervisor problem. Does ESXi produce any relevant logs? |
We have checked all the vmware logs. There are no errors/backtraces/warnings in vm*.log files. Did you hear about any freezes in CoreOS with 4.13.x kernel version ? |
I haven't heard of any other complete lockups of the kind you've described. |
Does these make any sense ? |
@bgilbert Do you have any update regarding this issue ? We are hitting this frequently in our environment. Since we have verified all other components in our setup(vmware, ESXi host, applications), only thing left is CoreOS. And we are sure that, we have not seen this issue in 4.12.x and nodes started to freeze from 4.13.x. |
Unfortunately, without logs or debug info of some sort, there's not an obvious path to make progress with this issue. The alpha and beta channels have been updated to 4.14.x; are you able to try the current beta? |
Unfortunately, we cannot try 4.14.x kernel in our product. We are in sync with stable channel but stopped updating our nodes to not go beyond 4.12.x until this freeze issue is resolved. We have couple of nodes running on 4.13.x kernel to provide you guys with debug logs. I've enabled kernel module debug logs for vmw_pvscsi and vm_vmci modules. At the time of freeze we are seeing no errors but these messages and the node froze after the last message:
|
@bgilbert We have seen another freeze in the morning today. Could you please suggest more ways to get appropriate logs ? Do you need debug logs of any specific kernel module ? Below are the modules just for reference:
|
@bgilbert Could you please let me know your thoughts on this issue ? |
@Eshakk You could try changing the kernel log level to include all debug messages. Otherwise I currently have no further thoughts beyond #2226 (comment). |
@bgilbert We have generated a memory dump of our VM at the time of freeze. It seems like we need debugging symbols to read this memory image. Could you please guide how to read the memory dump ? |
@bgilbert We have created vmss(states file) and vmem files by suspending the VM when it froze. We are able to construct a file from vmss and attaching to this issue. These are our findings:
Please let me know if you find anything in this states file. |
@bgilbert I believe from 1520 onwards CoreOS image should have debugging symbols. Why the uncompressed kernel image is showing that there are none ?
Could you please help me out how to read the memory image of CoreOS VM ? |
Debug symbols are not used in production images. They are available in binary packages, e.g. http://builds.developer.core-os.net/boards/amd64-usr/1576.4.0/pkgs/sys-kernel/coreos-kernel-4.13.16-r2.tbz2 |
@dm0- Thank you very much David. Will try this. |
@dm0- After using the above mentioned packages:
It is returning with status 1. Do I need to use specific options while starting toolbox ? |
@dm0- Hello David, Hope you enjoyed your holidays. Could you please help me with the toolbox issue I've posted above ? |
From the source, it looks like that error message means it thinks your core file is an invalid netdump file. |
@dm0- We have generated the vmss/vmem files using suspend in VMware and the core file using vmss2core tool. Is there any other standard way to do this ? |
Sorry, I am not familiar with this. VMware is a community supported platform, so maybe someone on the coreos-user mailing list or #coreos IRC channel would be knowledgeable about the issue. |
We've been experiencing this issue on bare metal servers on both 1576.4.0 and 1576.5.0. We have no logs at all, the kernel watchdogs do not do or log anything either. We've setup the system to panic when when we send an This is has happened 2 times within 1 day of putting production workload on them, on 2 different servers. We've begun rolling out this update last week. At this point we're looking at trying 1465.8.0 like @Eshakk but whether it works or not our only way forward is to move off CoreOS if we can't find a solution. We're running on Supermicro X9DRFF-i+ with 2 Intel(R) Xeon(R) CPU E5-2680 v2 and 64GB of ram. Let me know if how I can provide any additional info. |
@mathpl based on that information, it's hard to tell whether your issue is the same. It might be better to file it as a new bug since the hardware is so different (esxi vs baremetal). It could be worth trying the current alpha as well to see if intel microcode updates help with the issue. It could also be possible to run a kernel bisect if you can find a reliable enough way to trigger it, though that's certainly easier said than done. Unfortunately, without system logs or a reliable reproduction, it's difficult to do more than take guesses at what it might be. |
WFM, will create a new issue. |
It has been identified as printk is causing the kernel hang. After analyzing the memory image(with help of VMware), we are able to find the stack trace of the task running at the time of freeze:
The above stack trace indicates that there is a bogus pointer passed to netdev_rx_csum_fault which is causing general_protection and printk to hang without releasing logbuf_lock lock. Now, bad things happen all the time in production, but I do not think a kernel should hang on such error message. @bgilbert @euank @dm0- or anyone,
Please let me know if you have any questions. |
Issue Report
Bug
We have been seeing this issue since 1520.6.0.
All of a sudden, our VM stops responding and only hard reset from hypervisor is only the solution to get it back online. When we check the logs after hard reset of VM, surprisingly, there are no system logs at all. Logs just got cut off and VM is unresponsive. The below is snippet of such scenario:
sudo journalctl
:There are no logs just before
-- Reboot --
test in journalctl.Could someone please tell me what exactly is happening at the time of reboot ?
Why are timedated and timesyncd are being killed ?
Under what circumstances SIGKILL, SIGTERM will fail in linux ?
What does "Watchdog timeout" mean ? Does this mean that kernel is trying to protect cpu ?
Container Linux Version
Environment
We are running our VM in ESXi 6.0.0 sever and this is only VM associated for that sever.
Expected Behavior
These has to be system logs to know what is happening.
Actual Behavior
Logs got cut off and timedated and timesyncd are getting killed.
Reproduction Steps
Unknown. It happens randomly on some VMs.
The text was updated successfully, but these errors were encountered: