Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 skynet.memlimit 导致 core dump #494

Closed
Jexocn opened this issue May 10, 2016 · 6 comments
Closed

使用 skynet.memlimit 导致 core dump #494

Jexocn opened this issue May 10, 2016 · 6 comments

Comments

@Jexocn
Copy link
Contributor

Jexocn commented May 10, 2016

将 test/testmemlimit.lua 修改如下,即可出现

local skynet = require "skynet"

local names = {"cluster", "dns", "mongo", "mysql", "redis", "sharedata", "socket", "sproto"}

-- set sandbox memory limit to 1M, must set here (at start, out of skynet.start)
skynet.memlimit(1 * 1024 * 1024)

skynet.start(function()
    local a = {}
    local limit
    local ok, err = pcall(function()
        for i=1, 12355 do
            limit = i
            table.insert(a, {})
        end
    end)
    local libs = {}
    for k,v in ipairs(names) do
        libs[v] = require(v)
    end
    skynet.error(limit, err)
    skynet.exit()
end)

core backtrace 如下:

Core was generated by `./skynet examples/config'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000004102df in cloneproto ()
(gdb) bt
#0  0x00000000004102df in cloneproto ()
#1  0x0000000000411919 in lua_clonefunction ()
#2  0x00000000004232ca in luaL_loadfilex ()
#3  0x00000000004305fa in searcher_Lua ()
#4  0x0000000000413bef in luaD_precall ()
#5  0x0000000000413fc3 in luaD_call ()
#6  0x0000000000414021 in luaD_callnoyield ()
#7  0x00000000004116c9 in lua_callk ()
#8  0x000000000042fc3f in findloader ()
#9  0x000000000042fd70 in ll_require ()
#10 0x0000000000413bef in luaD_precall ()
#11 0x000000000041f36e in luaV_execute ()
#12 0x0000000000413fcf in luaD_call ()
#13 0x000000000041180e in lua_pcallk ()
#14 0x00000000004265df in luaB_xpcall ()
#15 0x0000000000413bef in luaD_precall ()
#16 0x000000000041f0c6 in luaV_execute ()
#17 0x000000000041340c in luaD_rawrunprotected ()
#18 0x0000000000414080 in lua_resume ()
#19 0x00000000004274a7 in auxresume ()
#20 0x00000000004277d7 in luaB_coresume ()
#21 0x0000000000413bef in luaD_precall ()
#22 0x000000000041f36e in luaV_execute ()
---Type <return> to continue, or q <return> to quit---
#23 0x0000000000413fcf in luaD_call ()
#24 0x0000000000414021 in luaD_callnoyield ()
#25 0x000000000041340c in luaD_rawrunprotected ()
#26 0x000000000041426d in luaD_pcall ()
#27 0x000000000041178c in lua_pcallk ()
#28 0x00000000004266b0 in luaB_pcall ()
#29 0x0000000000413bef in luaD_precall ()
#30 0x000000000041f36e in luaV_execute ()
#31 0x0000000000413fcf in luaD_call ()
#32 0x0000000000414021 in luaD_callnoyield ()
#33 0x000000000041340c in luaD_rawrunprotected ()
#34 0x000000000041426d in luaD_pcall ()
#35 0x000000000041178c in lua_pcallk ()
#36 0x00007f11881f6e29 in _cb (context=0x7f118ea40070, ud=0x7f118ea17c08, 
    type=1, session=1, source=0, msg=0x0, sz=0) at lualib-src/lua-skynet.c:50
#37 0x000000000040a008 in dispatch_message (ctx=0x7f118ea40070, 
    msg=0x7f118c7f6e40) at skynet-src/skynet_server.c:259
#38 0x000000000040aaa0 in skynet_context_message_dispatch (
    sm=sm@entry=0x7f118ea15840, q=q@entry=0x7f118a410100, 
    weight=weight@entry=-1) at skynet-src/skynet_server.c:313
#39 0x000000000040b19d in thread_worker (p=<optimized out>)
    at skynet-src/skynet_start.c:133
#40 0x00007f118fa14182 in start_thread (arg=0x7f118c7f7700)
---Type <return> to continue, or q <return> to quit---
    at pthread_create.c:312
#41 0x00007f118f02f47d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
@davidxifeng
Copy link
Contributor

davidxifeng commented May 10, 2016

我测了一下,重现了,初步猜测是内存限制下,共享函数原型那块的代码内存分配失败但没有做检查,然后直接crash了。

估计云大增加的共享proto这块的内存分配都要增加检查?

开启调试符号,挂在了这一行:
https://github.com/cloudwu/skynet/blob/master/3rd/lua/lapi.c#L1036

  f->p=luaM_newvector(L,n,struct Proto *);
  for (i=0; i<n; i++) f->p[i]=NULL;
  for (i=0; i<n; i++) {
    f->p[i]=cloneproto(L, src->p[i]); // 这里
  }

cloudwu added a commit that referenced this issue May 10, 2016
@cloudwu
Copy link
Owner

cloudwu commented May 10, 2016

谢谢。这是个很隐晦的 bug ,看看我的新提交,帮忙 review 一下 :)

简单描述一下问题:

写这段代码是考虑了内存分配不足的,但由于没有实际测试过,漏掉了一种情况。

proto 对象需要先关联在结构中,然后才能填写内部数据。否则,在内存分配失败时,lua gc 会尝试跑一遍收集,试图回收掉不用的内存。由于 proto 对象没有事先挂接,刚刚申请的对象就立刻被回收掉了,同时内存又变得够用,内存分配正常返回。而前面分配的对象已经释放,这样 f->p 指针为空。

修改方法是,把 luaF_newproto 调用从 cloneproto 中分离出来,先赋值,再递归调用 cloneproto 。

@davidxifeng
Copy link
Contributor

赞!

我要抓紧学习Lua源码了

@Jexocn
Copy link
Contributor Author

Jexocn commented May 11, 2016

测试脚本中的这段代码

    for k,v in ipairs(names) do
        libs[v] = require(v)
    end

改成

    for k,v in ipairs(names) do
        local ok, m = pcall(require, v)
        if ok then
            libs[v] = m
        end
    end

还是会出现 core dump
backtrace 如下:

Core was generated by `./skynet examples/config'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000415b08 in propagatemark ()
(gdb) bt
#0  0x0000000000415b08 in propagatemark ()
#1  0x0000000000416571 in singlestep ()
#2  0x0000000000416c38 in luaC_fullgc ()
#3  0x0000000000416cff in luaM_realloc_ ()
#4  0x00000000004117ce in cloneproto ()
#5  0x00000000004118fa in cloneproto ()
#6  0x0000000000411973 in lua_clonefunction ()
#7  0x000000000042330a in luaL_loadfilex ()
#8  0x000000000043063a in searcher_Lua ()
#9  0x0000000000413c2f in luaD_precall ()
#10 0x0000000000414003 in luaD_call ()
#11 0x0000000000414061 in luaD_callnoyield ()
#12 0x0000000000411589 in lua_callk ()
#13 0x000000000042fc7f in findloader ()
#14 0x000000000042fdb0 in ll_require ()
#15 0x0000000000413c2f in luaD_precall ()
#16 0x0000000000414003 in luaD_call ()
#17 0x00000000004116ce in lua_pcallk ()
#18 0x00000000004266f0 in luaB_pcall ()
#19 0x0000000000413c2f in luaD_precall ()
#20 0x000000000041f3ae in luaV_execute ()
#21 0x000000000041400f in luaD_call ()
#22 0x00000000004116ce in lua_pcallk ()
---Type <return> to continue, or q <return> to quit---
#23 0x000000000042661f in luaB_xpcall ()
#24 0x0000000000413c2f in luaD_precall ()
#25 0x000000000041f106 in luaV_execute ()
#26 0x000000000041344c in luaD_rawrunprotected ()
#27 0x00000000004140c0 in lua_resume ()
#28 0x00000000004274e7 in auxresume ()
#29 0x0000000000427817 in luaB_coresume ()
#30 0x0000000000413c2f in luaD_precall ()
#31 0x000000000041f3ae in luaV_execute ()
#32 0x000000000041400f in luaD_call ()
#33 0x0000000000414061 in luaD_callnoyield ()
#34 0x000000000041344c in luaD_rawrunprotected ()
#35 0x00000000004142ad in luaD_pcall ()
#36 0x000000000041164c in lua_pcallk ()
#37 0x00000000004266f0 in luaB_pcall ()
#38 0x0000000000413c2f in luaD_precall ()
#39 0x000000000041f3ae in luaV_execute ()
#40 0x000000000041400f in luaD_call ()
#41 0x0000000000414061 in luaD_callnoyield ()
#42 0x000000000041344c in luaD_rawrunprotected ()
#43 0x00000000004142ad in luaD_pcall ()
#44 0x000000000041164c in lua_pcallk ()
#45 0x00007f147adf5e29 in _cb (context=0x7f1478a28000, ud=0x7f1478a1e008, 
---Type <return> to continue, or q <return> to quit---
    type=1, session=1, source=0, msg=0x0, sz=0) at lualib-src/lua-skynet.c:50
#46 0x000000000040a038 in dispatch_message (ctx=0x7f1478a28000, 
    msg=0x7f147b7fae40) at skynet-src/skynet_server.c:259
#47 0x000000000040aad0 in skynet_context_message_dispatch (
    sm=sm@entry=0x7f1481615920, q=q@entry=0x7f1478a131c0, 
    weight=weight@entry=0) at skynet-src/skynet_server.c:313
#48 0x000000000040b1cd in thread_worker (p=<optimized out>)
    at skynet-src/skynet_start.c:133
#49 0x00007f14824ed182 in start_thread (arg=0x7f147b7fb700)
    at pthread_create.c:312
#50 0x00007f1481b0847d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

wangyi0226 added a commit to wangyi0226/skynet that referenced this issue May 11, 2016
@cloudwu
Copy link
Owner

cloudwu commented May 11, 2016

这次查了一下,好像不是修改 lua vm 引起的 bug ,而是 lua gc 本身的 bug :)

我先在原版 lua 那里写个 testcase 重现一下。

cloudwu added a commit that referenced this issue May 11, 2016
@cloudwu
Copy link
Owner

cloudwu commented May 11, 2016

不好意思,还是我的问题。
由于修改了 proto 结构共享,sizek (常量的个数) 和 sizep (子函数原型的个数) 也被共享了。

原版 lua 是在保证 f->kf->p 分配成功后,才给 f->sizek 以及 f->sizep 赋值的,所以在 gc mark 时,如果是空指针,循环长度也是 0 ,所以不会出错。

而修改版本,f->sp->sizekf->sp->sizep 一定不为 0 ,所以需要多一步检查 f->kf->p 是否为空指针。

wangyi0226 added a commit to wangyi0226/skynet that referenced this issue May 12, 2016
@Jexocn Jexocn closed this as completed Sep 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants