The `ipbus` deamon never closes the TCP connections #126

lpetre-ulb · 2019-07-23T10:47:52Z

Brief summary of issue

After killing the ipbus daemon on the CTP7, the program cannot be restarted immediately but one has to wait some time. During that period the bind sycall fails with the "address in use" error. It can be seen that the netstat tool shows TCP connections in the LAST_ACK state which progressively timeout.

Types of issue

Bug report (report an issue with the code)
Feature request (request for change which adds functionality)

Expected Behavior

The ipbus daemon should not keep TCP connections open once the client disconnects.

Current Behavior

While the daemon is running closed TCP connections are blocked in the CLOSE_WAIT state and the ipbus daemon consumes one full CTP7 CPU core after the first TCP connection is closed by the client.

Steps to Reproduce (for bugs)

Start the ipbus daemon on the CTP7
Connect to and disconnect from the CTP7 IPBus endpoint (using testConnectivity.py for example)
Look a the CTP7 CPU usage (e.g. using top)
Look a the non-closed TCP connection on the CTP7 (e.g. netstat -an; when writing this issue, 377 such connections are opened on eagle23 used for QC7)

Possible Solution (for bugs)

When the recv call returns 0 the socket must be closed:

xhal/xcompile/ipbus/Client.cpp

Lines 51 to 55 in d1f8b3d

    
           ssize_t readcount = recv(this->fd, buf, 128, MSG_DONTWAIT); 
        
           if (readcount < 0 && errno != EAGAIN) 
        
           	return false; // Error or disconnect. 
        
           if (readcount) 
        
           	this->ibuf += std::string(buf, readcount);

Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.

Your Environment

Version used: d1f8b3d

The text was updated successfully, but these errors were encountered:

jsturdy · 2019-07-23T14:04:20Z

Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.

I would tend to agree, as @lpetre-ulb is going to be submitting a PR to the upstream, we should try to have the version we are using as vanilla as possible, especially if the patch put in upstream solves these (old) issues.

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

lpetre-ulb · 2019-07-24T21:32:36Z

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

Yes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?

If the ipbus daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than the recv buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.

jsturdy · 2019-07-25T09:44:21Z

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

Yes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?

If the ipbus daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than the recv buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.

I don't remember if I tried playing with the recv buffer, because that definitely sounds like the type of limitation that was being hit...
The main thing is that I could bundle 1000s of transactions into a single dispatch() when communicating with a GLIB, but on a CTP7, this always errored when the number of transactions was somewhere between 5 and 20.

The most basic test would be something like:

import uhal, itertools

# set up hwdevice as uhal device

reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
regvals = [hwdevice.getNode(r).read() for r in reglist]
hwdevice.dispatch()
for k,r in itertools.zip(reglist,regvals):
    print("{:s}: 0x{:08x}".format(k,r))

You can extend this to do read/write mixed in the same dispatch:

import random
reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
wvals = [random.randint(0x0, 0xffffffff) for r in reglist]
for w,r in itertools.izip(wvals,reglist):
    hwdevice.getNode(r).write(w))
    regvals.append(hwdevice.getNode(r).read())
hwdevice.dispatch()
for k,r,w in itertools.zip(reglist,regvals,wvals):
    print("{:s}: 0x{:08x} (expected {:0x{08x})".format(k,r,w))

This could even be done for just a single register, by replacing reglist with a single register that you then read/write multiple times in succession.
This is what the scripts I was using (back during the slice test to debug a specific FW issue) were doing, and they can now be (temporarily) found here(uhal) and here(rwreg)

I don't want you to put too much (of your valuable) time into understanding the performance differences between uhal and native memsvc read/write, as this ship, I think, has sailed... though I am still interested, so if you can find why the "multiple dispatch" doesn't work, I may try to continue some performance testing on my own, or create a task for a newcomer

lpetre-ulb · 2020-05-18T07:47:30Z

The development branch moved to a templated RPC-only based solution.

lpetre-ulb mentioned this issue Jul 23, 2019

Two versions of memhub exist #127

Closed

2 tasks

lpetre-ulb closed this as completed May 18, 2020

lpetre-ulb added the Status: Wontfix label May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `ipbus` deamon never closes the TCP connections #126

The `ipbus` deamon never closes the TCP connections #126

lpetre-ulb commented Jul 23, 2019

jsturdy commented Jul 23, 2019

lpetre-ulb commented Jul 24, 2019

jsturdy commented Jul 25, 2019

lpetre-ulb commented May 18, 2020

The ipbus deamon never closes the TCP connections #126

The ipbus deamon never closes the TCP connections #126

Comments

lpetre-ulb commented Jul 23, 2019

Brief summary of issue

Types of issue

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Possible Solution (for bugs)

Your Environment

jsturdy commented Jul 23, 2019

lpetre-ulb commented Jul 24, 2019

jsturdy commented Jul 25, 2019

lpetre-ulb commented May 18, 2020

The `ipbus` deamon never closes the TCP connections #126

The `ipbus` deamon never closes the TCP connections #126