VM crash with UKSM #25

eczema · 2017-09-08T15:48:45Z

Hi,

I'm the main developper of eve-ng and we have integrated UKSM in our kernel.
We currently use ubuntu kernel 4.9.40 and we observe a lot of crash on big Qemu VM.

Indeed running 6 Big VM ( 2vCPU + 8G of ram and using a lot of interrupt inside the VM ) is unstable and not safe at all....

I understand that you need information so could you please give a set of required information needed for investigations ?

We could also communicate via mail ( eczema@ecze.com )

eczema · 2017-09-08T15:53:42Z

sample of error shown on VM

!!!! X64 Exception Type - 00(#DE - Divide Error) CPU Apic ID - 00000000 !!!!
RIP - 00000000BD7DA279, CS - 0000000000000038, RFLAGS - 0000000000010202
RAX - 0000000000001000, RCX - 000000000000000C, RDX - 0000000000000000
RBX - 0000000000001000, RSP - 000000007FBFC930, RBP - 000000007FBFC970
RSI - 0000000000000000, RDI - 00000000BD8BCA98
R8 - 00000000707D3800, R9 - 0000000000000000, R10 - 0000000000000000
R11 - 0000000000000018, R12 - 00000000BD9C3C60, R13 - 00000000BD9C3C68
R14 - 00000000BFB33620, R15 - 00000000BD8C8A98
DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
GS - 0000000000000030, SS - 0000000000000030
CR0 - 0000000080000033, CR2 - 0000000000000000, CR3 - 00000000BFABA000
CR4 - 0000000000000668, CR8 - 0000000000000000
DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 00000000BFAA8A98 0000000000000047, LDTR - 0000000000000000
IDTR - 00000000BF1DE018 0000000000000FFF, TR - 0000000000000000
FXSAVE_STATE - 000000007FBFC590
!!!! Find PE image (No PDB) (ImageBase=00000000BD7C7000, EntryPoint=00000000BD7D121C) !!!!

naixia · 2017-09-09T03:22:40Z

Hi, it seems a math calculation error. Is this error from the guest OS inside QEMU or from host OS?
A detailed crash information can be more helpful, here is the previous closed issue example:

#18

eczema · 2017-09-09T11:28:50Z

The problem occurs on VM provided by Cisco. ( nexus 9000v , a Cisco customised linux ) The issues are not always the same… Sometimes it is one process(not always the same process crash with SIG11, sometimes a kernel failure… ) With UKSM disabled, we never see any issue…. Is there any recommendation regarding Kernel option ( compile ) on the Host ? As we can’t tune the guest Kernel/Os, we can only tune kernel or Qemu options…. Alain

…

On 9 Sep 2017, at 05:22, naixia ***@***.***> wrote: Hi, it seems a math calculation error. Is this error from the guest OS inside QEMU or from host OS? A detailed crash information can be more helpful, here is the previous closed issue example: #18 <#18> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGON-PeACZ470NEc83MdYx7NqBZSNQ--ks5sggSBgaJpZM4PRXSM>.

naixia · 2017-09-11T04:57:55Z

It seems like a data corruption caused by KSM/UKSM.
Recently, upstream kernel detected a KSM data corruption bug that may also affect UKSM.
I backported the bug fix to v4.9, you can have a try. But I am not sure if you came across the same bug.
Apply the patch in a UKSM patched kernel tree:

ksm-data-corruption-v4.9.patch.txt

eczema · 2017-09-14T22:47:03Z

see here a scheduler crash observed:
Aug 13 15:34:04 eve-ng kernel: INFO: task qemu-system-x86:24307 blocked for more than 120 seconds.
Aug 13 15:34:04 eve-ng kernel: Not tainted 4.9.40-eve-ng-ukms+ #2
Aug 13 15:34:04 eve-ng kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 13 15:34:04 eve-ng kernel: qemu-system-x86 D 0 24307 12933 0x00000000
Aug 13 15:34:04 eve-ng kernel: ffff8bb6c5c5ec80 ffff8bba1a143e00 ffff8bba1a3d2d80 ffff8bbd4b875b00
Aug 13 15:34:04 eve-ng kernel: ffff8bba226d9300 ffffa68347d93dd8 ffffffffabe99952 ffff8bba1cfa60a0
Aug 13 15:34:04 eve-ng kernel: 0000000000000246 0000000000000003 0000000000000001 ffff8bbd4b875b00
Aug 13 15:34:04 eve-ng kernel: Call Trace:
Aug 13 15:34:04 eve-ng kernel: [] ? __schedule+0x232/0x6f0
Aug 13 15:34:04 eve-ng kernel: [] schedule+0x36/0x80
Aug 13 15:34:04 eve-ng kernel: [] jbd2_log_wait_commit+0x98/0x120
Aug 13 15:34:04 eve-ng kernel: [] ? wake_atomic_t_function+0x60/0x60
Aug 13 15:34:04 eve-ng kernel: [] jbd2_complete_transaction+0x5c/0xa0
Aug 13 15:34:04 eve-ng kernel: [] ext4_sync_file+0x1ef/0x3e0
Aug 13 15:34:04 eve-ng kernel: [] vfs_fsync_range+0x4b/0xb0
Aug 13 15:34:04 eve-ng kernel: [] ? SyS_futex+0x81/0x180
Aug 13 15:34:04 eve-ng kernel: [] do_fsync+0x3d/0x70
Aug 13 15:34:04 eve-ng kernel: [] SyS_fdatasync+0x13/0x20
Aug 13 15:34:04 eve-ng kernel: [] entry_SYSCALL_64_fastpath+0x1e/0xad
Aug 13 15:39:02 eve-ng CRON[25770]: pam_unix(cron:session): session opened for user root by (uid=0)

eczema · 2017-09-14T22:48:49Z

I did not test yet your patch... I will launch a compile and test...

eczema · 2017-09-14T22:58:03Z

is this bug could be related to :
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513 ?

eczema · 2017-09-15T07:57:16Z

Currently testing the last patch...
Seems ok... 7 x 8go RAM running with total memory used reduced to 18 Go... and no more process crash.... :-)

naixia · 2017-09-16T05:56:54Z

Glad to hear that. This fix will be included in UKSM for v4.13 and later versions.

eczema · 2017-09-17T07:56:58Z

I confirm...
This bug was the culprit for multiple issue observed.. Right now, uksm is just incredibly effective:
No more system instability - freezes of process - qemu crash - etc etc ... Now , runs so smoothly. It is awesome....

Regards and Hat off !!!

anudeep404 · 2018-05-26T11:48:10Z

Looks like I hit the same bug, I'm using EVE-NG version: 2.0.3-86,QEMU version: 2.4.0.
root@eve-ng:/var/log# uname -a
Linux eve-ng 4.9.40-eve-ng-ukms-2+ #4 SMP Fri Sep 15 02:07:02 CEST 2017 x86_64 x86_64 x86_64 GNU/Linux

I'm using a VM with 64 vCPUs and 128 gigs of RAM.

When I try to run more than 10 Nexus NK9's, I see the crashes and the 11th VM goes into boot loop. How do I fix this?

dolohow · 2018-05-26T15:17:34Z

Please create new bug report. With stacktrace and more information attached.

anudeep404 · 2018-05-26T19:24:40Z

I upgraded kernel to 4.14.44 and i'm still seeing the problem..will open a new bug. Thanks.

root@eve-ng:~# uname -a
Linux eve-ng 4.14.44 #1 SMP Sat May 26 20:12:35 EEST 2018 x86_64 x86_64 x86_64 GNU/Linux

Laynvedb · 2018-10-14T06:21:13Z

Currently testing the last patch...
Seems ok... 7 x 8go RAM running with total memory used reduced to 18 Go... and no more process crash.... :-)

Can you tell me how to patch it? Which patch?
I also encountered the same problem. kernel is 4.9.40-eve-ng-ukms-2+

xinbinhan · 2019-04-28T02:21:30Z

这个问题解决了吗？

naixia closed this as completed Sep 16, 2017

anudeep404 mentioned this issue May 26, 2018

VM running on QEMU crashes #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM crash with UKSM #25

VM crash with UKSM #25

eczema commented Sep 8, 2017

eczema commented Sep 8, 2017

naixia commented Sep 9, 2017

eczema commented Sep 9, 2017 via email

naixia commented Sep 11, 2017 •

edited

eczema commented Sep 14, 2017

eczema commented Sep 14, 2017

eczema commented Sep 14, 2017

eczema commented Sep 15, 2017

naixia commented Sep 16, 2017

eczema commented Sep 17, 2017

anudeep404 commented May 26, 2018 •

edited

dolohow commented May 26, 2018

anudeep404 commented May 26, 2018

Laynvedb commented Oct 14, 2018

xinbinhan commented Apr 28, 2019

VM crash with UKSM #25

VM crash with UKSM #25

Comments

eczema commented Sep 8, 2017

eczema commented Sep 8, 2017

naixia commented Sep 9, 2017

eczema commented Sep 9, 2017 via email

naixia commented Sep 11, 2017 • edited

eczema commented Sep 14, 2017

eczema commented Sep 14, 2017

eczema commented Sep 14, 2017

eczema commented Sep 15, 2017

naixia commented Sep 16, 2017

eczema commented Sep 17, 2017

anudeep404 commented May 26, 2018 • edited

dolohow commented May 26, 2018

anudeep404 commented May 26, 2018

Laynvedb commented Oct 14, 2018

xinbinhan commented Apr 28, 2019

naixia commented Sep 11, 2017 •

edited

anudeep404 commented May 26, 2018 •

edited