Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core dump under macosx #99

Closed
flashjay opened this issue Apr 26, 2014 · 3 comments
Closed

core dump under macosx #99

flashjay opened this issue Apr 26, 2014 · 3 comments

Comments

@flashjay
Copy link
Contributor

写了一个简单的生成、销毁agent的测试例子;配置8thread,开8个service,分别做循环启动、销毁agent(总共2^24次),想知道服务节点id超过24bit后会怎样?
每次都会跑一半core掉,如下
$ ulimit -c unlimited
$ ./skynet example/config
......
[:17dc074] LAUNCH snlua agent
[:17dc075] LAUNCH snlua agent
[:17dc076] LAUNCH snlua agent
[:17dc077] LAUNCH snlua agent
[:17dc078] LAUNCH snlua agent
[:17dc079] LAUNCH snlua agent
[:17dc07a] LAUNCH snlua agent
[:17dc07b] LAUNCH snlua agent
[:100000e] KILL :17dc074
[:17dc074] exit
[:100001e] KILL :17dc075
[:17dc075] exit
[:1000010] KILL :17dc076
[:17dc076] exit
[:100000f] KILL :17dc077
[:17dc077] exit
[:100001d] K./start.sh: line 3: 684 Segmentation fault: 11 (core dumped) ./skynet example/config

===========core dump如下=============
(lldb) bt all

  • thread some fixes #1: tid = 0x0000, 0x00007fff8b0e2a3a libsystem_kernel.dylib`__semwait_signal + 10, stop reason = signal SIGSTOP

    • frame #0: 0x00007fff8b0e2a3a libsystem_kernel.dylib__semwait_signal + 10 frame #1: 0x00007fff96ce17f3 libsystem_pthread.dylibpthread_join + 433
      frame a bug #2: 0x000000010a6c8b7a skynet_start(thread=<unavailable>) + 442 at skynet_start.c:174 frame #3: 0x000000010a6c8956 skynetskynet_start(config=0x00007fff5553a7f8) + 214 at skynet_start.c:222
      frame Mac OSX 支持补丁 #4: 0x000000010a6c61f6 skynet`main(argc=, argv=) + 1158 at skynet_main.c:131

    thread a bug #2: tid = 0x0001, 0x00007fff8b0e2a3a libsystem_kernel.dylib__semwait_signal + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2a3a libsystem_kernel.dylib__semwait_signal + 10
    frame some fixes #1: 0x00007fff97fd6dc0 libsystem_c.dylibnanosleep + 200 frame #2: 0x00007fff97fd6c1f libsystem_c.dylibsleep + 42
    frame can't compile #3: 0x000000010a6c8cc5 skynet_monitor(p=0x000000010b186020) + 101 at skynet_start.c:91 frame #4: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138
    frame Fixbug #5: 0x00007fff96cdd72a libsystem_pthread.dylib`_pthread_start + 137

    thread can't compile #3: tid = 0x0002, 0x00007fff8b0e2a3a libsystem_kernel.dylib__semwait_signal + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2a3a libsystem_kernel.dylib__semwait_signal + 10
    frame some fixes #1: 0x00007fff97fd6dc0 libsystem_c.dylibnanosleep + 200 frame #2: 0x00007fff97fd6cb2 libsystem_c.dylibusleep + 54
    frame can't compile #3: 0x000000010a6c8d2b skynet_timer(p=0x000000010b186020) + 59 at skynet_start.c:105 frame #4: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138
    frame Fixbug #5: 0x00007fff96cdd72a libsystem_pthread.dylib`_pthread_start + 137

    thread Mac OSX 支持补丁 #4: tid = 0x0003, 0x00007fff8b0e364a libsystem_kernel.dylibkevent + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e364a libsystem_kernel.dylibkevent + 10
    frame some fixes #1: 0x000000010a6ca7c3 skynetsocket_server_poll [inlined] sp_wait(max=<unavailable>) + 5 at socket_kqueue.h:70 frame #2: 0x000000010a6ca7be skynetsocket_server_poll(ss=0x000000010b400000, result=0x000000010aa0fe60, more=0x000000010aa0fe5c) + 382 at socket_server.c:835
    frame can't compile #3: 0x000000010a6c9bad skynetskynet_socket_poll + 45 at skynet_socket.c:75 frame #4: 0x000000010a6c8db9 skynet_socket(p=0x000000010b186020) + 89 at skynet_start.c:54
    frame Fixbug #5: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #6: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread Fixbug #5: tid = 0x0004, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread 最小化兼容处理 #6: tid = 0x0005, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread BUG: expand buffer 后所取的 slot 不正确,与下次要取的 slot 是同一个 slot。 #7: tid = 0x0006, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread bug handle_name没有正确修改 #8: tid = 0x0007, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread Create test #9: tid = 0x0008, 0x000000010a6cca9e skynetskynet_lalloc [inlined] skynet_free(ptr=0x000000010e051be0) + 111 at malloc_hook.c:173, stop reason = signal SIGSTOP frame #0: 0x000000010a6cca9e skynetskynet_lalloc [inlined] skynet_free(ptr=0x000000010e051be0) + 111 at malloc_hook.c:173
    frame some fixes #1: 0x000000010a6cca2f skynetskynet_lalloc(ud=0x00000000017dc19c, ptr=0x000000010e051be0, osize=4470229392, nsize=<unavailable>) + 31 at malloc_hook.c:194 frame #2: 0x000000010a6d8d87 skynetluaM_realloc_ + 39
    frame can't compile #3: 0x000000010a6d5c45 skynetsweeplist + 405 frame #4: 0x000000010a6d5a96 skynetluaC_freeallobjects + 230
    frame Fixbug #5: 0x000000010a6dd6f2 skynetclose_state + 34 frame #6: 0x000000010a8867e1 snlua.sosnlua_release(l=0x000000010e0658a0) + 17 at service_snlua.c:277
    frame BUG: expand buffer 后所取的 slot 不正确,与下次要取的 slot 是同一个 slot。 #7: 0x000000010a6c7994 skynetskynet_context_release [inlined] _delete_context(ctx=0x000000010e082be0) + 12 at skynet_server.c:152 frame #8: 0x000000010a6c7988 skynetskynet_context_release(ctx=0x000000010e082be0) + 24 at skynet_server.c:161
    frame Create test #9: 0x000000010a6c6471 skynetskynet_handle_retire(handle=25018780) + 113 at skynet_handle.c:79 frame #10: 0x000000010a6c8367 skynetskynet_command [inlined] handle_exit(context=, handle=) + 5 at skynet_server.c:287
    frame The header file "inet.h" is included twice in the file lua_socket.c in line 13 and 14 #11: 0x000000010a6c8362 skynetskynet_command(context=<unavailable>, cmd=<unavailable>, param=<unavailable>) + 1538 at skynet_server.c:372 frame #12: 0x000000010ab9c63f skynet.so_command(L=0x000000010c49d2c0) + 95 at lua-skynet.c:85
    frame Fix bug in connection service #13: 0x000000010a6d3b08 skynetluaD_precall + 520 frame #14: 0x000000010a6e153b skynetluaV_execute + 1915
    frame gate bugs #15: 0x000000010a6d44c0 skynetunroll + 160 frame #16: 0x000000010a6d35d6 skynetluaD_rawrunprotected + 86
    frame bug fix in gate #17: 0x000000010a6d4153 skynetlua_resume + 83 frame #18: 0x000000010a6e6482 skynetauxresume + 82
    frame Little slip #19: 0x000000010a6e61b9 skynetluaB_coresume + 73 frame #20: 0x000000010aba2bdd profile.solresume(L=0x000000010c41f200) + 189 at lua-profile.c:105
    frame message queue should shrink #21: 0x000000010a6d3b08 skynetluaD_precall + 520 frame #22: 0x000000010a6e153b skynetluaV_execute + 1915
    frame compat52.c luaL_traceback #23: 0x000000010a6d40c2 skynetluaD_call + 66 frame #24: 0x000000010a6d35d6 skynetluaD_rawrunprotected + 86
    frame socket.open failed #25: 0x000000010a6d4588 skynetluaD_pcall + 56 frame #26: 0x000000010a6cf157 skynetlua_pcallk + 215
    frame socket.write failed #27: 0x000000010a6e524c skynetluaB_pcall + 76 frame #28: 0x000000010a6d3b08 skynetluaD_precall + 520
    frame 小bug:luacompat/compat52.c:116 行返回语句多了一个 return  #29: 0x000000010a6e153b skynetluaV_execute + 1915 frame #30: 0x000000010a6d40c2 skynetluaD_call + 66
    frame sendname bug #31: 0x000000010a6d35d6 skynetluaD_rawrunprotected + 86 frame #32: 0x000000010a6d4588 skynetluaD_pcall + 56
    frame Completion of error handling #33: 0x000000010a6cf157 skynetlua_pcallk + 215 frame #34: 0x000000010ab9c8d6 skynet.so_cb(context=0x000000010c498160, ud=0x000000010c41f200, type=, session=708227, source=16777222, msg=0x000000010c489250, sz=) + 182 at lua-skynet.c:33
    frame cause to "Aborted (core dumped)" #35: 0x000000010a6c7c0d skynetskynet_context_message_dispatch [inlined] _dispatch_message(ctx=<unavailable>, msg=0x000ace8301000006) + 68 at skynet_server.c:205 frame #36: 0x000000010a6c7bc9 skynetskynet_context_message_dispatch(sm=0x000000010b0104e0) + 217 at skynet_server.c:241
    frame skynet-src/skynet_mq.c racing condition #37: 0x000000010a6c8e08 skynet_worker(p=<unavailable>) + 56 at skynet_start.c:121 frame #38: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138
    frame 无法超过两根工作线程同时工作 #39: 0x00007fff96cdd72a libsystem_pthread.dylib`_pthread_start + 137

    thread Update connection/connection.c #10: tid = 0x0009, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread The header file "inet.h" is included twice in the file lua_socket.c in line 13 and 14 #11: tid = 0x000a, 0x000000010a6ddae8 skynetluaS_newlstr + 232, stop reason = signal SIGSTOP frame #0: 0x000000010a6ddae8 skynetluaS_newlstr + 232
    frame some fixes #1: 0x000000010a6ce37e skynetlua_getglobal + 62 frame #2: 0x000000010a6c9968 skynetskynet_getenv(key=) + 56 at skynet_env.c:26
    frame can't compile #3: 0x000000010a6c8392 skynetskynet_command(context=<unavailable>, cmd=<unavailable>, param=0x000000010d932718) + 1586 at skynet_server.c:409 frame #4: 0x000000010ab9c63f skynet.so_command(L=0x000000010d499700) + 95 at lua-skynet.c:85
    frame Fixbug #5: 0x000000010a6d3b08 skynetluaD_precall + 520 frame #6: 0x000000010a6e1587 skynetluaV_execute + 1991
    frame BUG: expand buffer 后所取的 slot 不正确,与下次要取的 slot 是同一个 slot。 #7: 0x000000010a6d40c2 skynetluaD_call + 66 frame #8: 0x000000010a6cf059 skynetlua_callk + 73
    frame Create test #9: 0x000000010a6eda49 skynetll_require + 489 frame #10: 0x000000010a6d3b08 skynetluaD_precall + 520
    frame The header file "inet.h" is included twice in the file lua_socket.c in line 13 and 14 #11: 0x000000010a6e153b skynetluaV_execute + 1915 frame #12: 0x000000010a6d40c2 skynetluaD_call + 66
    frame Fix bug in connection service #13: 0x000000010a6d35d6 skynetluaD_rawrunprotected + 86 frame #14: 0x000000010a6d4588 skynetluaD_pcall + 56
    frame gate bugs #15: 0x000000010a6cf157 skynetlua_pcallk + 215 frame #16: 0x000000010a8866b6 snlua.so_init(l=, ctx=0x000000010d4a3890, args=) + 566 at service_snlua.c:223
    frame bug fix in gate #17: 0x000000010a886341 snlua.so_launch(context=0x000000010d4a3890, ud=0x000000010d448f80, type=<unavailable>, session=<unavailable>, source=<unavailable>, msg=0x000000010cda1370, sz=6) + 113 at service_snlua.c:242 frame #18: 0x000000010a6c7c0d skynetskynet_context_message_dispatch [inlined] _dispatch_message(ctx=, msg=0x00000000017dc1a7) + 68 at skynet_server.c:205
    frame Little slip #19: 0x000000010a6c7bc9 skynetskynet_context_message_dispatch(sm=0x000000010b010520) + 217 at skynet_server.c:241 frame #20: 0x000000010a6c8e08 skynet_worker(p=) + 56 at skynet_start.c:121
    frame message queue should shrink #21: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #22: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

    thread skynet #12: tid = 0x000b, 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fff8b0e2716 libsystem_kernel.dylib__psynch_cvwait + 10
    frame some fixes #1: 0x00007fff96cdfc3b libsystem_pthread.dylib_pthread_cond_wait + 727 frame #2: 0x000000010a6c8e36 skynet_worker(p=) + 102 at skynet_start.c:127
    frame can't compile #3: 0x00007fff96cdd899 libsystem_pthread.dylib_pthread_body + 138 frame #4: 0x00007fff96cdd72a libsystem_pthread.dylib_pthread_start + 137

@cloudwu
Copy link
Owner

cloudwu commented Apr 27, 2014

我今天用 master 的 TOP 版本在 linux 上做了测试. 驱动了 2^25 次 agent 可以正常做完.
在 mac mini 上做了同样的测试, 的确会 core dump .

从上面的 core 信息看,

thread #9: tid = 0x0008, 0x000000010a6cca9e skynetskynet_lalloc [inlined] skynet_free(ptr=0x000000010e051be0) + 111 at malloc_hook.c:173, stop reason = signal SIGSTOP frame #0: 0x000000010a6cca9e skynetskynet_lalloc [inlined] skynet_free(ptr=0x000000010e051be0) + 111 at malloc_hook.c:173
frame #1: 0x000000010a6cca2f skynetskynet_lalloc(ud=0x00000000017dc19c, ptr=0x000000010e051be0, osize=4470229392, nsize=<unavailable>) + 31 at malloc_hook.c:194 frame #2: 0x000000010a6d8d87 skynetluaM_realloc_ + 39
frame #3: 0x000000010a6d5c45 skynetsweeplist + 405 frame #4: 0x000000010a6d5a96 skynetluaC_freeallobjects + 230

引起崩溃的地方在这里. 因为 lua 的 lalloc 的 ud 只能是 0 , 这里是 0x00000000017dc19c 应该是被其它地方写坏了内存. 但, 诡异的是, lua 的 global state 结构里, frealloc 函数指针和 ud 是排布在一起的。这里显然 frealloc 函数指针是正确的,但是 ud 却被修改为一个非 0 的值。

另外 osize 也太大了. 是错误的值. (多半是在销毁 string 的时候取出了错误的 string 长度)

如果有 core 文件, 可以看一下到底 global state 结构被什么东西篡改了. 有可能可以判断出问题的起因。

其实只是启动和销毁 service 是很简单的业务. 问题应该容易查证. 我尝试把 jemalloc 替换回 malloc , 似乎没有问题.

进一步确认问题, 还需要点时间排查. (可以考虑替换分配器, 关掉 -O 优化等)

如果可能, 协助找到 bug .

@cloudwu
Copy link
Owner

cloudwu commented Apr 27, 2014

后来试了几次, 再也不 core dump 了 :(

@cloudwu
Copy link
Owner

cloudwu commented Apr 28, 2014

这个问题应该是 jemalloc 造成的. 因为 skynet 没有直接使用 jemalloc 的 malloc zone (在 macosx 下)

缺少了 malloc zone 的 force_lockforce_unlock 的调用, 在多线程环境下似乎 je_malloc 工作似乎有问题.

目前的解决方案是在 macosx 下关闭 jemalloc , 使用标准库的 malloc 就没有问题了.

@cloudwu cloudwu closed this as completed Apr 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants