Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

调用 skynet.abort() 关进程,概率出现段错误 #1458

Closed
RiceCN opened this issue Aug 18, 2021 · 12 comments
Closed

调用 skynet.abort() 关进程,概率出现段错误 #1458

RiceCN opened this issue Aug 18, 2021 · 12 comments

Comments

@RiceCN
Copy link

RiceCN commented Aug 18, 2021

Program terminated with signal 11, Segmentation fault.
#0 0x000000000047506f in je_arena_dalloc_promoted ()
Missing separate debuginfos, use: debuginfo-install libgcc-4.4.7-23.el6.x86_64
(gdb) bt
#0 0x000000000047506f in je_arena_dalloc_promoted ()
#1 0x000000000046627d in je_free_default () at include/jemalloc/internal/arena_inlines_b.h:284
#2 0x000000000042205b in free () at skynet-src/malloc_hook.c:205
#3 0x00007ff83f15f689 in _dl_deallocate_tls (tcb=0x7ff83a3d7700, dealloc_tcb=false) at dl-tls.c:478
#4 0x00007ff83ef37b3d in __free_stacks (limit=41943040) at allocatestack.c:283
#5 0x00007ff83ef37c4c in queue_stack (pd=) at allocatestack.c:311
#6 __deallocate_stack (pd=) at allocatestack.c:747
#7 0x00007ff83ef39124 in pthread_join (threadid=140704074254080, thread_return=0x0) at pthread_join.c:110
#8 0x000000000041bb20 in skynet_start () at skynet-src/skynet_start.c:227
#9 0x00000000004182f1 in main () at skynet-src/skynet_main.c:166

备注:进程开启core监测后,调用 skynet.abort() 关进程是可以正常把进程给kill掉,此问题概率出现

@cloudfreexiao
Copy link
Contributor

cloudfreexiao commented Aug 18, 2021

和这个有点像 代码是最新的和配置对了?
#1314

@cloudwu
Copy link
Owner

cloudwu commented Aug 18, 2021

任何 C 模块的内存 bug 都有可能导致堆破坏,所以以上信息无法推断实际问题。

  1. 代码是否更新到最新?
  2. 尝试用 Valgrind 找到更多线索。

@RiceCN
Copy link
Author

RiceCN commented Aug 19, 2021

代码版本是最新的

@RiceCN
Copy link
Author

RiceCN commented Aug 19, 2021

使用 valgrind --tool=memcheck --leak-check=full 调试模式后,段错误出现后的log输出如下:
==19247== Parent PID: 19104
==19247==
==19247== Thread 6:
==19247== Syscall param write(buf) points to uninitialised byte(s)
==19247== at 0x4E3F7BD: ??? (syscall-template.S:82)
==19247== by 0x41C546: send_request (socket_server.c:1753)
==19247== by 0x41F5F2: socket_server_listen (socket_server.c:1998)
==19247== by 0xF8108CE: llisten (lua-socket.c:491)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== by 0x43E170: luaV_execute (lvm.c:1626)
==19247== by 0x42855A: unroll (ldo.c:685)
==19247== by 0x427F79: luaD_rawrunprotected (ldo.c:144)
==19247== by 0x42A002: lua_resume (ldo.c:788)
==19247== by 0x7C237CB: lua_resumeX (service_snlua.c:90)
==19247== by 0x7C237CB: auxresume (service_snlua.c:146)
==19247== by 0x7C237CB: timing_resume (service_snlua.c:198)
==19247== by 0x7C23ACF: luaB_coresume (service_snlua.c:217)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== Address 0xb5fe888 is on thread 6's stack
==19247== in frame #2, created by socket_server_listen (socket_server.c:1984)
==19247==
==19247== Syscall param write(buf) points to uninitialised byte(s)
==19247== at 0x4E3F7BD: ??? (syscall-template.S:82)
==19247== by 0x41C546: send_request (socket_server.c:1753)
==19247== by 0x41F6C1: socket_server_start (socket_server.c:2020)
==19247== by 0xF80FCAC: lstart (lua-socket.c:618)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== by 0x43E170: luaV_execute (lvm.c:1626)
==19247== by 0x42855A: unroll (ldo.c:685)
==19247== by 0x427F79: luaD_rawrunprotected (ldo.c:144)
==19247== by 0x42A002: lua_resume (ldo.c:788)
==19247== by 0x7C237CB: lua_resumeX (service_snlua.c:90)
==19247== by 0x7C237CB: auxresume (service_snlua.c:146)
==19247== by 0x7C237CB: timing_resume (service_snlua.c:198)
==19247== by 0x7C23ACF: luaB_coresume (service_snlua.c:217)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== Address 0xb5fe8ac is on thread 6's stack
==19247== in frame #2, created by socket_server_start (socket_server.c:2016)
==19247==
==19247== Thread 12:
==19247== Syscall param write(buf) points to uninitialised byte(s)
==19247== at 0x4E3F7BD: ??? (syscall-template.S:82)
==19247== by 0x41C546: send_request (socket_server.c:1753)
==19247== by 0x41EEDC: socket_server_connect (socket_server.c:1790)
==19247== by 0xF810990: lconnect (lua-socket.c:463)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== by 0x43E170: luaV_execute (lvm.c:1626)
==19247== by 0x429848: ccall (ldo.c:577)
==19247== by 0x429848: luaD_call (ldo.c:587)
==19247== by 0x423FCE: lua_pcallk (lapi.c:1071)
==19247== by 0x4478FF: luaB_pcall (lbaselib.c:456)
==19247== by 0x429375: luaD_precall (ldo.c:532)
==19247== by 0x43E170: luaV_execute (lvm.c:1626)
==19247== by 0x42855A: unroll (ldo.c:685)
==19247== Address 0xf802702 is on thread 12's stack
==19247== in frame #2, created by socket_server_connect (socket_server.c:1785)
==19247==
==19247== Thread 1:
==19247== Invalid read of size 8
==19247== at 0x47303F: edata_szind_set (edata.h:458)
==19247== by 0x47303F: arena_prof_demote (arena.c:1158)
==19247== by 0x47303F: je_arena_dalloc_promoted (arena.c:1174)
==19247== by 0x46424C: arena_dalloc_large (arena_inlines_b.h:284)
==19247== by 0x46424C: arena_dalloc (arena_inlines_b.h:333)
==19247== by 0x46424C: idalloctm (jemalloc_internal_inlines_c.h:120)
==19247== by 0x46424C: ifree (jemalloc.c:2765)
==19247== by 0x46424C: je_free_default (jemalloc.c:2892)
==19247== by 0x4011688: _dl_deallocate_tls (dl-tls.c:478)
==19247== by 0x4E37B3C: __free_stacks (allocatestack.c:283)
==19247== by 0x4E37C4B: queue_stack (allocatestack.c:311)
==19247== by 0x4E37C4B: __deallocate_stack (allocatestack.c:747)
==19247== by 0x4E39123: pthread_join (pthread_join.c:110)
==19247== by 0x41AF1F: start (skynet_start.c:227)
==19247== by 0x41AF1F: skynet_start (skynet_start.c:289)
==19247== by 0x418280: main (skynet_main.c:166)
==19247== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==19247==
==19247==
==19247== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==19247== Access not within mapped region at address 0x0
==19247== at 0x47303F: edata_szind_set (edata.h:458)
==19247== by 0x47303F: arena_prof_demote (arena.c:1158)
==19247== by 0x47303F: je_arena_dalloc_promoted (arena.c:1174)
==19247== by 0x46424C: arena_dalloc_large (arena_inlines_b.h:284)
==19247== by 0x46424C: arena_dalloc (arena_inlines_b.h:333)
==19247== by 0x46424C: idalloctm (jemalloc_internal_inlines_c.h:120)
==19247== by 0x46424C: ifree (jemalloc.c:2765)
==19247== by 0x46424C: je_free_default (jemalloc.c:2892)
==19247== by 0x4011688: _dl_deallocate_tls (dl-tls.c:478)
==19247== by 0x4E37B3C: __free_stacks (allocatestack.c:283)
==19247== by 0x4E37C4B: queue_stack (allocatestack.c:311)
==19247== by 0x4E37C4B: __deallocate_stack (allocatestack.c:747)
==19247== by 0x4E39123: pthread_join (pthread_join.c:110)
==19247== by 0x41AF1F: start (skynet_start.c:227)
==19247== by 0x41AF1F: skynet_start (skynet_start.c:289)
==19247== by 0x418280: main (skynet_main.c:166)
==19247== If you believe this happened as a result of a stack
==19247== overflow in your program's main thread (unlikely but
==19247== possible), you can try to increase the size of the
==19247== main thread stack using the --main-stacksize= flag.
==19247== The main thread stack size used in this run was 10485760.
==19247==
==19247== HEAP SUMMARY:
==19247== in use at exit: 74 bytes in 1 blocks
==19247== total heap usage: 1 allocs, 0 frees, 74 bytes allocated
==19247==
==19247== LEAK SUMMARY:
==19247== definitely lost: 0 bytes in 0 blocks
==19247== indirectly lost: 0 bytes in 0 blocks
==19247== possibly lost: 0 bytes in 0 blocks
==19247== still reachable: 74 bytes in 1 blocks
==19247== suppressed: 0 bytes in 0 blocks
==19247== Reachable blocks (those to which a pointer was found) are not shown.
==19247== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==19247==
==19247== For counts of detected and suppressed errors, rerun with: -v
==19247== Use --track-origins=yes to see where uninitialised values come from
==19247== ERROR SUMMARY: 12 errors from 4 contexts (suppressed: 0 from 0)

在我的理解这句是关键: Access not within mapped region at address 0x0 ,但我还是不清楚怎么产生的

@cloudwu
Copy link
Owner

cloudwu commented Aug 19, 2021

https://github.com/cloudwu/skynet/wiki/MemoryHook 你可以 -D MEMORY_CHECK 看看内置的检查能不能检查出问题.

另外,The main thread stack size used in this run was 10485760. thread stack 用了 10M 内存, 这是不正常的.

@cloudwu
Copy link
Owner

cloudwu commented Aug 19, 2021

因为没有其他类似报告, 所以还需要特别关注一下你的环境中还用了哪些 C 模块以及是否有什么特殊的用法.

@firedtoad
Copy link

firedtoad commented Aug 19, 2021 via email

@RiceCN
Copy link
Author

RiceCN commented Aug 19, 2021

因没有使用过clang编译器,这个【换成clang 编译器 打开 -fsanitize=address】是说用clang编译器编译skynet后再调试?

@firedtoad
Copy link

firedtoad commented Aug 19, 2021 via email

@RiceCN
Copy link
Author

RiceCN commented Sep 22, 2021

已找到问题所在,运行时调用了一个外部接口(获取唯一字符串),不调用此方法关服正常
static int
lgetStrUUID(lua_State *L) {
uuid_t u;
char buf[40];
memset(buf, 0, 40);
uuid_generate(u);
uuid_unparse(u, buf);
lua_pushstring(L, buf);
return 1;
}

@RiceCN RiceCN closed this as completed Sep 22, 2021
@cloudwu
Copy link
Owner

cloudwu commented Sep 23, 2021

lua_pushstring(L, buf);

此处应该是 lua_pushlstring(L, buf, 40); ,因为 lua_pushstring 要求 0 结尾,而上面的代码并不保证。

@RiceCN
Copy link
Author

RiceCN commented Sep 24, 2021

@cloudwu 学习了,惭愧惭愧,技术不到位

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants