Skip to content
This repository has been archived by the owner on Jan 31, 2022. It is now read-only.

The ipbus deamon never closes the TCP connections #126

Closed
1 of 2 tasks
lpetre-ulb opened this issue Jul 23, 2019 · 4 comments
Closed
1 of 2 tasks

The ipbus deamon never closes the TCP connections #126

lpetre-ulb opened this issue Jul 23, 2019 · 4 comments

Comments

@lpetre-ulb
Copy link
Contributor

Brief summary of issue

After killing the ipbus daemon on the CTP7, the program cannot be restarted immediately but one has to wait some time. During that period the bind sycall fails with the "address in use" error. It can be seen that the netstat tool shows TCP connections in the LAST_ACK state which progressively timeout.

Types of issue

  • Bug report (report an issue with the code)
  • Feature request (request for change which adds functionality)

Expected Behavior

The ipbus daemon should not keep TCP connections open once the client disconnects.

Current Behavior

While the daemon is running closed TCP connections are blocked in the CLOSE_WAIT state and the ipbus daemon consumes one full CTP7 CPU core after the first TCP connection is closed by the client.

Steps to Reproduce (for bugs)

  1. Start the ipbus daemon on the CTP7
  2. Connect to and disconnect from the CTP7 IPBus endpoint (using testConnectivity.py for example)
  3. Look a the CTP7 CPU usage (e.g. using top)
  4. Look a the non-closed TCP connection on the CTP7 (e.g. netstat -an; when writing this issue, 377 such connections are opened on eagle23 used for QC7)

Possible Solution (for bugs)

When the recv call returns 0 the socket must be closed:

ssize_t readcount = recv(this->fd, buf, 128, MSG_DONTWAIT);
if (readcount < 0 && errno != EAGAIN)
return false; // Error or disconnect.
if (readcount)
this->ibuf += std::string(buf, readcount);

Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.

Your Environment

@jsturdy
Copy link
Contributor

jsturdy commented Jul 23, 2019

Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.

I would tend to agree, as @lpetre-ulb is going to be submitting a PR to the upstream, we should try to have the version we are using as vanilla as possible, especially if the patch put in upstream solves these (old) issues.

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

@lpetre-ulb
Copy link
Contributor Author

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

Yes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?

If the ipbus daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than the recv buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.

@jsturdy
Copy link
Contributor

jsturdy commented Jul 25, 2019

One question/request for @lpetre-ulb, can you do an quick investigation on a uhal call that sends multiple (~10-100) transactions within a single dispatch?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR

Yes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?

If the ipbus daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than the recv buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.

I don't remember if I tried playing with the recv buffer, because that definitely sounds like the type of limitation that was being hit...
The main thing is that I could bundle 1000s of transactions into a single dispatch() when communicating with a GLIB, but on a CTP7, this always errored when the number of transactions was somewhere between 5 and 20.

The most basic test would be something like:

import uhal, itertools

# set up hwdevice as uhal device

reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
regvals = [hwdevice.getNode(r).read() for r in reglist]
hwdevice.dispatch()
for k,r in itertools.zip(reglist,regvals):
    print("{:s}: 0x{:08x}".format(k,r))

You can extend this to do read/write mixed in the same dispatch:

import random
reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
wvals = [random.randint(0x0, 0xffffffff) for r in reglist]
for w,r in itertools.izip(wvals,reglist):
    hwdevice.getNode(r).write(w))
    regvals.append(hwdevice.getNode(r).read())
hwdevice.dispatch()
for k,r,w in itertools.zip(reglist,regvals,wvals):
    print("{:s}: 0x{:08x} (expected {:0x{08x})".format(k,r,w))

This could even be done for just a single register, by replacing reglist with a single register that you then read/write multiple times in succession.
This is what the scripts I was using (back during the slice test to debug a specific FW issue) were doing, and they can now be (temporarily) found here(uhal) and here(rwreg)

I don't want you to put too much (of your valuable) time into understanding the performance differences between uhal and native memsvc read/write, as this ship, I think, has sailed... though I am still interested, so if you can find why the "multiple dispatch" doesn't work, I may try to continue some performance testing on my own, or create a task for a newcomer

@lpetre-ulb
Copy link
Contributor Author

The development branch moved to a templated RPC-only based solution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants