Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

VM crash with UKSM #25

Closed
eczema opened this issue Sep 8, 2017 · 15 comments
Closed

VM crash with UKSM #25

eczema opened this issue Sep 8, 2017 · 15 comments

Comments

@eczema
Copy link

eczema commented Sep 8, 2017

Hi,

I'm the main developper of eve-ng and we have integrated UKSM in our kernel.
We currently use ubuntu kernel 4.9.40 and we observe a lot of crash on big Qemu VM.

Indeed running 6 Big VM ( 2vCPU + 8G of ram and using a lot of interrupt inside the VM ) is unstable and not safe at all....

I understand that you need information so could you please give a set of required information needed for investigations ?

We could also communicate via mail ( eczema@ecze.com )

@eczema
Copy link
Author

eczema commented Sep 8, 2017

sample of error shown on VM

!!!! X64 Exception Type - 00(#DE - Divide Error) CPU Apic ID - 00000000 !!!!
RIP - 00000000BD7DA279, CS - 0000000000000038, RFLAGS - 0000000000010202
RAX - 0000000000001000, RCX - 000000000000000C, RDX - 0000000000000000
RBX - 0000000000001000, RSP - 000000007FBFC930, RBP - 000000007FBFC970
RSI - 0000000000000000, RDI - 00000000BD8BCA98
R8 - 00000000707D3800, R9 - 0000000000000000, R10 - 0000000000000000
R11 - 0000000000000018, R12 - 00000000BD9C3C60, R13 - 00000000BD9C3C68
R14 - 00000000BFB33620, R15 - 00000000BD8C8A98
DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
GS - 0000000000000030, SS - 0000000000000030
CR0 - 0000000080000033, CR2 - 0000000000000000, CR3 - 00000000BFABA000
CR4 - 0000000000000668, CR8 - 0000000000000000
DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 00000000BFAA8A98 0000000000000047, LDTR - 0000000000000000
IDTR - 00000000BF1DE018 0000000000000FFF, TR - 0000000000000000
FXSAVE_STATE - 000000007FBFC590
!!!! Find PE image (No PDB) (ImageBase=00000000BD7C7000, EntryPoint=00000000BD7D121C) !!!!

@naixia
Copy link
Collaborator

naixia commented Sep 9, 2017

Hi, it seems a math calculation error. Is this error from the guest OS inside QEMU or from host OS?
A detailed crash information can be more helpful, here is the previous closed issue example:

#18

@eczema
Copy link
Author

eczema commented Sep 9, 2017 via email

@naixia
Copy link
Collaborator

naixia commented Sep 11, 2017

It seems like a data corruption caused by KSM/UKSM.
Recently, upstream kernel detected a KSM data corruption bug that may also affect UKSM.
I backported the bug fix to v4.9, you can have a try. But I am not sure if you came across the same bug.
Apply the patch in a UKSM patched kernel tree:

ksm-data-corruption-v4.9.patch.txt

@eczema
Copy link
Author

eczema commented Sep 14, 2017

see here a scheduler crash observed:
Aug 13 15:34:04 eve-ng kernel: INFO: task qemu-system-x86:24307 blocked for more than 120 seconds.
Aug 13 15:34:04 eve-ng kernel: Not tainted 4.9.40-eve-ng-ukms+ #2
Aug 13 15:34:04 eve-ng kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 13 15:34:04 eve-ng kernel: qemu-system-x86 D 0 24307 12933 0x00000000
Aug 13 15:34:04 eve-ng kernel: ffff8bb6c5c5ec80 ffff8bba1a143e00 ffff8bba1a3d2d80 ffff8bbd4b875b00
Aug 13 15:34:04 eve-ng kernel: ffff8bba226d9300 ffffa68347d93dd8 ffffffffabe99952 ffff8bba1cfa60a0
Aug 13 15:34:04 eve-ng kernel: 0000000000000246 0000000000000003 0000000000000001 ffff8bbd4b875b00
Aug 13 15:34:04 eve-ng kernel: Call Trace:
Aug 13 15:34:04 eve-ng kernel: [] ? __schedule+0x232/0x6f0
Aug 13 15:34:04 eve-ng kernel: [] schedule+0x36/0x80
Aug 13 15:34:04 eve-ng kernel: [] jbd2_log_wait_commit+0x98/0x120
Aug 13 15:34:04 eve-ng kernel: [] ? wake_atomic_t_function+0x60/0x60
Aug 13 15:34:04 eve-ng kernel: [] jbd2_complete_transaction+0x5c/0xa0
Aug 13 15:34:04 eve-ng kernel: [] ext4_sync_file+0x1ef/0x3e0
Aug 13 15:34:04 eve-ng kernel: [] vfs_fsync_range+0x4b/0xb0
Aug 13 15:34:04 eve-ng kernel: [] ? SyS_futex+0x81/0x180
Aug 13 15:34:04 eve-ng kernel: [] do_fsync+0x3d/0x70
Aug 13 15:34:04 eve-ng kernel: [] SyS_fdatasync+0x13/0x20
Aug 13 15:34:04 eve-ng kernel: [] entry_SYSCALL_64_fastpath+0x1e/0xad
Aug 13 15:39:02 eve-ng CRON[25770]: pam_unix(cron:session): session opened for user root by (uid=0)

@eczema
Copy link
Author

eczema commented Sep 14, 2017

I did not test yet your patch... I will launch a compile and test...

@eczema
Copy link
Author

eczema commented Sep 14, 2017

is this bug could be related to :
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513 ?

@eczema
Copy link
Author

eczema commented Sep 15, 2017

Currently testing the last patch...
Seems ok... 7 x 8go RAM running with total memory used reduced to 18 Go... and no more process crash.... :-)

@naixia
Copy link
Collaborator

naixia commented Sep 16, 2017

Glad to hear that. This fix will be included in UKSM for v4.13 and later versions.

@naixia naixia closed this as completed Sep 16, 2017
@eczema
Copy link
Author

eczema commented Sep 17, 2017

I confirm...
This bug was the culprit for multiple issue observed.. Right now, uksm is just incredibly effective:
No more system instability - freezes of process - qemu crash - etc etc ... Now , runs so smoothly. It is awesome....

Regards and Hat off !!!

@anudeep404
Copy link

anudeep404 commented May 26, 2018

Looks like I hit the same bug, I'm using EVE-NG version: 2.0.3-86,QEMU version: 2.4.0.
root@eve-ng:/var/log# uname -a
Linux eve-ng 4.9.40-eve-ng-ukms-2+ #4 SMP Fri Sep 15 02:07:02 CEST 2017 x86_64 x86_64 x86_64 GNU/Linux

I'm using a VM with 64 vCPUs and 128 gigs of RAM.

When I try to run more than 10 Nexus NK9's, I see the crashes and the 11th VM goes into boot loop. How do I fix this?

@dolohow
Copy link
Owner

dolohow commented May 26, 2018

Please create new bug report. With stacktrace and more information attached.

@anudeep404
Copy link

I upgraded kernel to 4.14.44 and i'm still seeing the problem..will open a new bug. Thanks.

root@eve-ng:~# uname -a
Linux eve-ng 4.14.44 #1 SMP Sat May 26 20:12:35 EEST 2018 x86_64 x86_64 x86_64 GNU/Linux

@Laynvedb
Copy link

Currently testing the last patch...
Seems ok... 7 x 8go RAM running with total memory used reduced to 18 Go... and no more process crash.... :-)

Can you tell me how to patch it? Which patch?
I also encountered the same problem. kernel is 4.9.40-eve-ng-ukms-2+

@xinbinhan
Copy link

这个问题解决了吗?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants