Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP server implementation #46

Merged
merged 4 commits into from
Apr 10, 2021
Merged

Conversation

fmauch
Copy link
Collaborator

@fmauch fmauch commented Mar 14, 2021

This implements an event-based tcp server for easier use. This could be a solution for #45

In this commit I only implemented it for the ScriptSender, but it could also
be used (probably with modifications) inside the ReverseInterface.

@urmahp
Copy link
Collaborator

urmahp commented Mar 15, 2021

This implements an event-based tcp server for easier use. This could be a solution for #45

In general I think this is a really good solution and I think the event-based tcp is a good approach. I have tested that it works with the ScriptSender, it works very well and it solves #45.

In this commit I only implemented it for the ScriptSender, but it could also be used (probably with modifications) inside the ReverseInterface.

As you mention I also think that it should be used inside the ReverseInterface. I could look into implementing it in the reverse interface if needed?

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 15, 2021

As you mention I also think that it should be used inside the ReverseInterface. I could look into implementing it in the reverse interface if needed?

I went ahead and implemented it, but you could test it a bit more, if you'd like.

Currently, stopping the ROS driver with CTRL-C results in a RTDE pipeline producer overflow that I still need to hunt down

@urmahp
Copy link
Collaborator

urmahp commented Mar 15, 2021

I went ahead and implemented it, but you could test it a bit more, if you'd like.

I will test that it works with the reverse interface as well.

Currently, stopping the ROS driver with CTRL-C results in a RTDE pipeline producer overflow that I still need to hunt down

It is related to the pipeline running in the RTDEClient stopping the pipeline, when the object is destroyed, should fix that issue. Below example could be a solution:

RTDEClient::~RTDEClient()
{
  pipeline_.stop();
}

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 15, 2021

It is related to the pipeline running in the RTDEClient stopping the pipeline, when the object is destroyed, should fix that issue. Below example could be a solution:

RTDEClient::~RTDEClient()
{
  pipeline_.stop();
}

Yep, that does the trick :-)

@fmauch fmauch marked this pull request as ready for review March 15, 2021 13:37
@fmauch fmauch changed the title WIP: tcp server implementation TCP server implementation Mar 15, 2021
@fmauch
Copy link
Collaborator Author

fmauch commented Mar 15, 2021

@urmahp I'm done with implementing this ATM, please go ahead and test it as much as you can :-)

Before merging, I'd like to also refactor the tcp_client a bit more so it is more understandable and introduce unit tests for the server and client. Let me know, if you'd like to participate in one of these.

@urmahp
Copy link
Collaborator

urmahp commented Mar 15, 2021

@urmahp I'm done with implementing this ATM, please go ahead and test it as much as you can :-)

I will be happy to test it.

Before merging, I'd like to also refactor the tcp_client a bit more so it is more understandable and introduce unit tests for the server and client. Let me know, if you'd like to participate in one of these.

I can work on some unit test for the server and client, while testing it.

src/comm/tcp_server.cpp Outdated Show resolved Hide resolved
src/comm/tcp_server.cpp Outdated Show resolved Hide resolved
@fmauch fmauch self-assigned this Mar 16, 2021
@urmahp
Copy link
Collaborator

urmahp commented Mar 16, 2021

I have tested that it works with both the ROS driver and the Isaac driver and overall it is working very well.

However I see one issue once in a while, it is not occuring every time, only once in a while.
When stopping the urcap program running on the robot, it seems like the client is not correctly disconnected from the server, because we keep sending data, but there is no client connected. Also the handle_program_state function isn't called.
When playing the program again I get the error that the client is already connected, so that also points towards the client not being disconnected correctly. I have inserted the log output below:

INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 73: New client connected at FD 12. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 85: Robot requested program 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 123: Sent program to robot 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 78: Client at FD 12 disconnected. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 155: Robot connected to reverse interface. Ready to receive control commands. 
Program running: true

ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 168: Connection to reverse interface dropped. 
Program running: false

INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 73: New client connected at FD 12. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 85: Robot requested program 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 123: Sent program to robot 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 78: Client at FD 12 disconnected. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 155: Robot connected to reverse interface. Ready to receive control commands. 
Program running: true

ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 267: recv() failed 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 

Unfortunately I haven't been able to track down the issue completely, but I believe it might have something to do with that the disconnect callback functions isn't called, because the server maybe is in a specific state at the point of disconnect. The error message ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 267: recv() failed is only printed once the error occurs.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 16, 2021

I'll try to look into it tonight.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 16, 2021

Hm, I once hit

[ INFO] [1615925511.466963882]: Client at FD 23 disconnected.
[ INFO] [1615925511.491965322]: Robot connected to reverse interface. Ready to receive control commands.
[ WARN] [1615925514.072904746]: Connection attempt on port 50001 while maximum number of clients (1) is already connected. Closing connection.
[ERROR] [1615925514.225187393]: recv() failed
[ERROR] [1615925514.225742741]: Sending data through socket failed.
[ERROR] [1615925514.233970460]: Sending data through socket failed.
[... more of those]

I finally found a way to reproduce this:

  1. Play the program on the TP -> Wireshark shows me that the machines are exchanging [PSH, ACK] messages
  2. Pause the program on the TP -> Communication is stopped on a TCP level, where the robot sends a last [ACK] package.
  3. Stop the program on the TP -> ´recv() failed` on the reverse interface. -> This will close the socket connection.
    Wireshark shows me that the robot sends a [RST, ACK] package on tcp level.
  4. Play the program on the TP -> Sending data fails. The connection attempt from the robot is declined (closed again immediately) leaving us with no active socket on both ends. I guess, the script hangs in this line as we didn't see the log message "ExternalControl: External control active" on the TP)

Open questions as for now:

  1. Is this different with the current master branch (I assume, the answer is "yes")
    Edit: -> Yes. The master branch works fine. On a TCP package level things behave the same (Why wouldn't they?) and the connection gets closed properly when pressing "stop" on the TP.
  2. Why is it that way?
    Edit2: -> This line returns false, as UrServer.readLine() fails due to ::read fails here. All this will end [this loop] reconstructing the complete ReverseInterface. I was quite happy to get rid of this loop, destruct, reset structure... Anyway, in my new implementation we do close the socket in case of a read failure, but we don't call the disconnectCallback here. Therefore, the ReverseInterface doesn't know it's not connected anymore.
  3. Where does that [RST, ACK] message come from?
    Edit2: -> Honestly I don't care, as it should be working with that, anyway.
    Edit3: -> Actually, I do care, as this produces a read failure on our side of the socket.
  4. How do we handle this?
    Edit3: -> 341b34f should fix it. However, in that case we still get an error output that reading from the socket failed which is related to question 3 above.
    Edit 4: -> By excluding errno ECONNRESET we can suppress the output in that case.

@urmahp
Copy link
Collaborator

urmahp commented Mar 17, 2021

Thank you very much @fmauch for looking into the issue and fixing it.

I apologize that I haven't time to look into the unit test yet, but I can look into it today if needed. What did you have in mind with the unit test?
Was it something similar to the test_rtde_client test, testing that a UrDriver object can be created and testing that connection can be established between a robot and the client library or was it to test the tcp_server class you have implemented?

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 17, 2021

I apologize that I haven't time to look into the unit test yet, but I can look into it today if needed. What did you have in mind with the unit test?

Nothing to apologize.
I'd like to test the complete server structure. Do connects / disconnects get handled correctly? Do the correct callbacks get called? Are messages correctly transmitted? Are all error cases handled properly?

Additionally, the current keepalive implementation is not really satisfying. If we connect to the reverse interface, it will be happy as long as we don't disconnect. We should check whether we actually receive a keepalive signal. We should get a keepalive signal (basically any message currently) from the robot before we send the next message. I'd like to implement this using a unit test that should trigger a disconnect event when the keepalive signal is missing. @urmahp would you like to implement this?

@urmahp
Copy link
Collaborator

urmahp commented Mar 22, 2021

I will start developing some unit test to the server now, I will create a pr to your branch once I have some test. I will look into the keepalive implementation as well.

@fmauch fmauch linked an issue Mar 24, 2021 that may be closed by this pull request
@fmauch
Copy link
Collaborator Author

fmauch commented Mar 24, 2021

@urmahp I'll have some time to work on this today and tomorrow. If you didn't start with the keepalive implementation, I'll start doing that.

@urmahp
Copy link
Collaborator

urmahp commented Mar 24, 2021

@urmahp I'll have some time to work on this today and tomorrow. If you didn't start with the keepalive implementation, I'll start doing that.

I haven't started the keepalive implementation.

I have added a PR to your branch on some unit test for the tcp server.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 24, 2021

I've created a draft implementation of the keepalive handling here. Let's keep discussions on this there.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 25, 2021

@urmahp If you could take another review over the PR that would be really nice.

@urmahp
Copy link
Collaborator

urmahp commented Mar 25, 2021

I will test how it is working and I will also look into the sigpipe error which made one of the tcp server test fail on this build.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 25, 2021

I will test how it is working and I will also look into the sigpipe error which made one of the tcp server test fail on this build.

I think, I found the error:

When running the test standalone with output in rare cases (I needed to run it up to 2500 times to trigger) I get this failure:

[ RUN      ] TCPServerTest.client_connections
/home/mauch/robot_folders/checkout/ur_robot_driver_split/catkin_ws/src/Universal_Robots_Client_Library/tests/test_tcp_server.cpp:319: Failure
Value of: server.write(client1_fd, data, len, written)
  Actual: true
Expected: false

I think, that this is the problematic code block:

void TCPServer::handleDisconnect(const int fd)
{
  URCL_LOG_DEBUG("%d disconnected.", fd);
  close(fd);
  if (disconnect_callback_)
  {
    disconnect_callback_(fd);
  }
  FD_CLR(fd, &masterfds_);

 [...]
}

Calling the disconnect_callback before closing the socket might lead to a situation where an application thinks that the socket is closed when it is actually not. Therefore, this check returns true, while it should return false. The actual send command seems to be happily feeding things in and returning success, but this actually results in a SIGPIPE event.

I'll leave the test looping over night and it should not have failed tomorrow morning.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 26, 2021

95000 successful tests later I would say that this was indeed the problem.

README.md Outdated Show resolved Hide resolved
@urmahp
Copy link
Collaborator

urmahp commented Mar 26, 2021

95000 successful tests later I would say that this was indeed the problem.

Nice thanks for solving the issue.

I have tested that it works, and it is working very well 😃

I believe these three variable declaration could be deleted as they aren't used anymore in the UrDriver.

I think that when merging this line should be removed from the ROS driver, else it would log a warning every control cycle, because the ROS driver is still using the script from the resource folder in the ROS driver.

@fmauch
Copy link
Collaborator Author

fmauch commented Mar 26, 2021

95000 successful tests later I would say that this was indeed the problem.

Nice thanks for solving the issue.

It was me introducing the issue, anyway ;-) Thanks for writing the tests that showed that :-)

I believe these three variable declaration could be deleted as they aren't used anymore in the UrDriver.

Addressed in 587a584

I think that when merging this line should be removed from the ROS driver, else it would log a warning every control cycle, because the ROS driver is still using the script from the resource folder in the ROS driver.

Addressed in UniversalRobots/Universal_Robots_ROS_Driver#346

fmauch and others added 4 commits March 26, 2021 11:10
This implements an event-based tcp server for easier use. It is
decoupled from the TCPSocket class to remove one layer of complexity.

Also, the event-based approach allows interrupting this server as it
won't be blocking inside an accept or recv call if not applicable.
@fmauch
Copy link
Collaborator Author

fmauch commented Mar 26, 2021

I cleaned up the branch, from my side this is ready to merge.

@urmahp
Copy link
Collaborator

urmahp commented Mar 26, 2021

I cleaned up the branch, from my side this is ready to merge.

I agree.

@fmauch fmauch merged commit 7437f6e into UniversalRobots:master Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terminate called without an active exception
2 participants