Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway sigsegv's when cleaning up channels using ca_clear_channel #1

Open
ralphlange opened this issue Mar 19, 2016 · 7 comments
Open
Assignees

Comments

@ralphlange
Copy link
Contributor

Original LaunchPad Bug #1279147 reported by Murali Shankar on 2014-02-12:

At LCLS, the archiver appliances connect to the IOC's thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to the gateway cleaning up PVs (Feb 07 04:42) from an IOC that is CPU overloaded and keeps disconnecting ( Feb 07 02:41).

From the gateway logs...

>> Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting
>> Feb 07 02:21:23 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068

>> Feb 07 02:21:23 !!! Errlog message received (message is above)
>> Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting

>> Feb 07 02:41:49 !!! Errlog message received (message is above)
>> Feb 07 02:41:49 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068
>> Feb 07 04:42:32 PV Gateway Aborting (SIGSEGV)

I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. I can provide more details/core if needed.

Regards,
Murali

(gdb) bt
#0 0x0016c410 in __kernel_vsyscall ()
#1 0x0086de30 in raise () from /lib/libc.so.6
#2 0x0086f741 in abort () from /lib/libc.so.6
#3 0x080513a4 in sig_end (sig=11) at ../gateway.cc:300
#4 <signal handler called>
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
#7 0x007512b7 in nciu::destroy (this=0x17e24b88, guard=...) at ../nciu.cpp:93
#8 0x00768347 in oldChannelNotify::destructor (this=0x17e179f0, guard=...) at ../oldChannelNotify.cpp:71
#9 0x00749039 in ca_clear_channel (pChan=0x17e179f0) at ../access.cpp:386
#10 0x080582e0 in gatePvData::~gatePvData (this=0x157f79b0, __in_chrg=<value optimized out>) at ../gatePv.cc:240
#11 0x08062064 in gatePvNode::destroy (this=0x1ca02110) at ../gateServer.h:69
#12 0x0805d6e7 in gateServer::inactiveDeadCleanup (this=0x925af40) at ../gateServer.cc:1490
#13 0x08060fc8 in gateServer::mainLoop (this=0x925af40) at ../gateServer.cc:285
#14 0x0804ef18 in startEverything (prefix=0xbfd7bbe2 "GWLCLSARCH") at ../gateway.cc:656
#15 0x080511a8 in main (argc=16, argv=0xbfd7b494) at ../gateway.cc:1299
……
(gdb) up
#4 <signal handler called>
(gdb) up
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
238 prevNode.pNext = theNode.pNext;
(gdb) print theNode
$1 = (tsDLNode<nciu> &) @0x17e24b98: {pNext = 0x17d44d68, pPrev = 0x0}
(gdb) up
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
1981    this->createReqPend.remove ( chan );
(gdb) print chan
$2 = (nciu &) @0x17e24b88: {<cacChannel> = {_vptr.cacChannel = 0x781168, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99,
    static priorityArchive = 49, static priorityOPI = 0, callback = @0x17e179f0}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {
        id = 833073}, <No data fields>}, <tsSLNode<nciu>> = {pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x17d44d68, pPrev = 0x0},
    listMember = cs_createReqPend}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7811d8}, eventq = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, accessRightState = {
    f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x925e2d8, pNameStr = 0x1c5838a8 "BLM:UND1:MP01:XILINX_CELS.LOW", piiu = 0xaf728260,
  sid = 4294967295, count = 0, retry = 1, nameLength = 30, typeCode = 65535, priority = 0 '\000'}
(gdb) quit

More information
This is PV Gateway Version 2.0.3.0 [Mar 2 2012 09:46:57]
Gateway is built against base-R3-14-12 with a few patches applied (I can provide a full list if needed).
IOC eioc-und1-mp01 runs on RTEMS-4.9.4-slac_p0 on top of EPICS R3.14.12-SLAC_1 $Date 2010/11/27

@ralphlange
Copy link
Contributor Author

Murali Shankar (mshankar) wrote on 2014-02-12:

Results of thread apply all bt in a core.
backtrace_log.txt

@ralphlange ralphlange self-assigned this Mar 19, 2016
@marciodo
Copy link

Is there an update on this issue?

@ralphlange
Copy link
Contributor Author

I'm afraid not. I was never able to reproduce the issue, and they never got back on it.
Why? Do you experience the problem?

@marciodo
Copy link

We had 3 events like this last month at SLAC. I'll try to narrow down the possible reasons. So far I have identified that when the Gateway call ca_clear_channel, the code in tcpiiu.cpp tries to remove an item from the ncui linked list, but at this point, the list is already empty. So, looks like there is another function entering in a condition that causes this list to be clean.

Currently, we are using Gateway R2.1.2.0 and EPICS R7.0.3.1.

@anjohnson
Copy link
Member

Hi Márcio, Murali's original report from 2012 said the IOC involved was running on RTEMS 4.9.4. Does this crash happen with other IOCs running on other OSs?

There are over 1000 threads in the full back-trace attached above, so the gateway was connected over 500 IOCs at the time. I don't see any obvious smoking guns in that, but I wasn't really expecting to.

@anjohnson
Copy link
Member

We've just seen this issue, on our most heavily loaded gateway although I believe it was running an older version of the GW code (2.0.something) and probably Base 3.15.5. I may still have access to the core file, but don't really have time to investigate it myself in detail right now so I'm leaving this comment as a marker.

Core was generated by `/home/helios/GATEWAY/gateway/ctlapps4-vm/pvgatemain1 -log gateway.log -putlog g'.
Program terminated with signal 6, Aborted.
#0  0x00007f4c50eab387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 readline-6.2-11.el7.x86_64
(gdb) where
#0  0x00007f4c50eab387 in raise () from /lib64/libc.so.6
#1  0x00007f4c50eaca78 in abort () from /lib64/libc.so.6
#2  0x000000000040b903 in sig_end (sig=11) at ../gateway.cc:311
#3  <signal handler called>
#4  remove (item=..., this=0x7f4c446fb570) at ../../../include/tsDLList.h:238
#5  tcpiiu::uninstallChan (this=0x7f4c446fb310, guard=..., chan=...) at ../tcpiiu.cpp:1978
#6  0x00000000004713f9 in nciu::destroy (this=0x7f4b89d72030, callbackGuard=...,
    mutualExcusionGuard=...) at ../nciu.cpp:95
#7  0x0000000000466839 in oldChannelNotify::destructor (this=0x37db7e0, cbGuard=...,
    mutexGuard=...) at ../oldChannelNotify.cpp:72
#8  0x000000000045dbf6 in ca_clear_channel (pChan=0x37db7e0) at ../access.cpp:391
#9  0x0000000000411726 in gatePvData::~gatePvData (this=0x1b67d40,
    __in_chrg=<optimized out>) at ../gatePv.cc:240
#10 0x0000000000418473 in destroy (this=0x34ea780) at ../gateServer.h:69
#11 gateServer::inactiveDeadCleanup (this=0xaf28c0) at ../gateServer.cc:1500
#12 0x0000000000418903 in gateServer::mainLoop (this=0xaf28c0) at ../gateServer.cc:296
#13 0x000000000040c219 in startEverything (prefix=0x7fff2774e10b "GW:401:MAIN1")
    at ../gateway.cc:685
#14 0x000000000040e317 in main (argc=20, argv=0x7fff2774dbb8) at ../gateway.cc:1349

@anjohnson
Copy link
Member

Apparently it was running Gateway version 2.0 built against base-3.14.12.5-static (on RHEL-6 or RHEL-7). We've bumped it to a newer version since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants