Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu使用率突然降到很低 #350

Closed
yibiaochen opened this issue Oct 7, 2015 · 8 comments
Closed

cpu使用率突然降到很低 #350

yibiaochen opened this issue Oct 7, 2015 · 8 comments

Comments

@yibiaochen
Copy link
Contributor

image
云风,我们的项目在线上跑了一段时间后,发生过两次这样的情况:skynet进程cpu使用率突然降到很低很低,1%~2%左右,正常是200%~300%,进程是没有挂掉的,看起来只是突然间基本不处理任何消息了。帮忙看看我应该如何去定位这里的问题

1.3是cpu,55.6是内存

@cloudwu
Copy link
Owner

cloudwu commented Oct 8, 2015

你可以修改源代码加一些 log , 所有的工作线程的主函数全部在 https://github.com/cloudwu/skynet/blob/master/skynet-src/skynet_start.c 这里, 这个文件并不复杂.

ps. 可以先确认是不是网络不收消息了.

@yibiaochen
Copy link
Contributor Author

好的,我先排除一下网络问题,再给skynet_start加log来看看
按理说网络不收消息的话,不应该马上降得那么快,这个机器跑着3500人,即使网络不收消息,还会自动战斗一段时间。 而且我观察到后台没有了任何log打印出来(正常情况下我们会让与服务器失联比较长时间的agent退出,这时会打印一条KILL self)

@yibiaochen
Copy link
Contributor Author

今天发生了第三次这样的情况,确认了不是没收网络包的问题,然后我发现每次出问题都是热更新之后
查了热更新代码,发现做热更新时,有加载目标文件的服务都会被执行一次codecache.clear()
目前觉得问题大概率是:多个服务调用codecache.clear(),如果调用过程中此服务退出了而没有解锁(比如玩家下了线),其它线程在发起clear的时候,一直在等待解锁

@cloudwu
Copy link
Owner

cloudwu commented Oct 12, 2015

  1. 没有机会在 codecache.clear 里死锁. https://github.com/cloudwu/skynet/blob/master/3rd/lua/lauxlib.c#L985-L993
  2. codecache.clear 不是这么用的. 如果你需要加载一个文件不 cache bytecode, 应该自己写一个 load 函数. 打开文件读出来, 然后 load 这个字符串.
  3. spin lock 死锁的特征是 cpu 空转并占满.

@cloudwu
Copy link
Owner

cloudwu commented Oct 12, 2015

如果你怀疑死锁,直接 gdb attach 进去看调用栈就可以确认。

@yibiaochen
Copy link
Contributor Author

这里这两周又出现了两次,上周出现后我进去看了调用栈,其中一条线程卡在了解析域名时读取 hosts文件上,其它线程基本都是卡在socket的send_request上,于是我们将域名解析交给nginx反向代理去做,顺便完全解决了域名解析的阻塞问题。
刚刚出现的的这次,看到的都是卡在了文件打开上,lua文件或者pb文件
Thread 2 (Thread 0x7f522f3f0700 (LWP 27859)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f523c4d0b01 in _L_lock_423 () at genops.c:1320
#2 0x00007f523c4ced68 in __GI__IO_link_in (fp=fp@entry=0x7f46873c2a80) at genops.c:105
#3 0x00007f523c4cdbe2 in _IO_new_file_init (fp=fp@entry=0x7f46873c2a80) at fileops.c:150
#4 0x00007f523c4c2493 in _fopen_internal (filename=0x7f45e92e33e8 "script/war/perform/summon/p12029.lua", mode=0x45d6b4 "r", is32=1) at iofopen.c:86
#5 0x000000000041f1fc in luaL_loadfilex
()
#6 0x0000000000420b87 in luaL_loadfilex ()
#7 0x0000000000424ca6 in luaB_loadfile ()
#8 0x0000000000412048 in luaD_precall ()
#9 0x000000000041cf54 in luaV_execute ()
#10 0x00000000004122c0 in unroll ()
#11 0x00000000004119ac in luaD_rawrunprotected ()
#12 0x00000000004124b0 in lua_resume ()
#13 0x0000000000424fe7 in auxresume ()
#14 0x0000000000425317 in luaB_coresume ()
#15 0x0000000000412048 in luaD_precall ()
#16 0x000000000041cf54 in luaV_execute ()
#17 0x000000000041243c in luaD_call ()
#18 0x00000000004119ac in luaD_rawrunprotected ()
#19 0x000000000041269d in luaD_pcall ()
#20 0x000000000040fd3c in lua_pcallk ()
#21 0x0000000000424130 in luaB_pcall ()
#22 0x0000000000412048 in luaD_precall ()
#23 0x000000000041cf54 in luaV_execute ()
#24 0x000000000041243c in luaD_call ()
#25 0x00000000004119ac in luaD_rawrunprotected ()
#26 0x000000000041269d in luaD_pcall ()
#27 0x000000000040fd3c in lua_pcallk ()

Thread 4 (Thread 0x7f52303f2700 (LWP 27857)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f523c4d0b01 in _L_lock_423 () at genops.c:1320
#2 0x00007f523c4ced68 in __GI__IO_link_in (fp=fp@entry=0x7f38f7942c00) at genops.c:105
#3 0x00007f523c4cdbe2 in _IO_new_file_init (fp=fp@entry=0x7f38f7942c00) at fileops.c:150
#4 0x00007f523c4c2493 in __fopen_internal (filename=0x7f3c9052a958 "script/protocol/sysbuy.pb", mode=0x7f3d780a0618 "rb", is32=1) at iofopen.c:86
#5 0x0000000000427224 in io_open ()
#6 0x0000000000412048 in luaD_precall ()
#7 0x000000000041cf54 in luaV_execute ()
#8 0x00000000004122c0 in unroll ()
#9 0x00000000004119ac in luaD_rawrunprotected ()
#10 0x00000000004124b0 in lua_resume ()
#11 0x0000000000424fe7 in auxresume ()
#12 0x0000000000425317 in luaB_coresume ()
#13 0x0000000000412048 in luaD_precall ()
#14 0x000000000041cf54 in luaV_execute ()
#15 0x000000000041243c in luaD_call ()
#16 0x00000000004119ac in luaD_rawrunprotected ()
#17 0x000000000041269d in luaD_pcall ()
#18 0x000000000040fd3c in lua_pcallk ()
#19 0x0000000000424130 in luaB_pcall ()
#20 0x0000000000412048 in luaD_precall ()
#21 0x000000000041cf54 in luaV_execute ()
#22 0x000000000041243c in luaD_call ()
#23 0x00000000004119ac in luaD_rawrunprotected ()
#24 0x000000000041269d in luaD_pcall ()
#25 0x000000000040fd3c in lua_pcallk ()

这么看来,有无可能是codecache.clear()之后,每个虚拟机重新open的时候文件被锁住了?

@cloudwu
Copy link
Owner

cloudwu commented Nov 11, 2015

  1. skynet 没有主动写文件操作, 你可以查一下文件为什么被锁.
  2. 我不认为 codecache.clear 本身有死锁逻辑存在. 它没有清空旧 cache ,只是创建了一份新的.
  3. 再次强调, 不要在业务逻辑里写 codecache.clear, 它仅仅用于线上调试, 不能作为业务逻辑实现手段. wiki 上对此有详细说明: https://github.com/cloudwu/skynet/wiki/CodeCache (注:目前的版本提供了更简单的 cache 管理模式)
  4. skynet 已提供了异步 dns 查询模块: https://github.com/cloudwu/skynet/wiki/Socket#%E5%9F%9F%E5%90%8D%E6%9F%A5%E8%AF%A2

@yibiaochen
Copy link
Contributor Author

查出来了,原因在于io.popen
1.io.popen的时候,会fork出一个进程出来,这在你以前的blog里面有提到过了会将锁复制过去。
2.我们的热更新,会将要更新的文件名写入一个txt,每分钟定时读取这txt,做热更新。写下这段代码的同学可能觉得io.popen可以很方便地执行一个shell的脚本,于是读取txt的地方正是用了io.popen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants