You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using the ACE ORB supported by OCI, https://theaceorb.com/, wich is TAO 2.2. But looking at the code here on github, it seems that the code is the same as the one on master branch here, with TAO 3.1.
Host machine and operating system
Built on an almalinux 9.2 docker image, run on Red Hat Enterprise Linux release 9.3
The $ACE_ROOT/include/makeinclude/platform_macros.GNU file
exceptions=1
optimize=0
threads=1
ssl=1
zlib=1
include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
buildbits = 64
Contents of $ACE_ROOT/bin/MakeProjectCreator/config/default.features
ssl=1
zlib = 1
Our application acts as a server for various other programs, but also delegates a few functions to another service. In this interaction, the application is acting as a client towards the other service, and multiple threads can perform requests concurrently. The ORBs use an svc config file with the following content:
We've seing a few threads remaining stuck waiting for a replies, and these errors in the logs:
Exclusive_TMS::dispatch_reply - <2 != 1>
I've raised the log levels and investigated the issue. Whent the SSLIOP connector creates a new Transport, it is added to the Transport Cache Manager with state ENTRY_IDLE_AND_PURGABLE. Under heavy load, a different thread can fetch the same Transport from tha cache while the first thread is using it to send a request.
There are a few consequences:
when using the mux strategy exclusive, the code will discard one of the responses, because it violates the constraint that there can only be one pending operation at the time. This is what causes our threads to remain stuck
when using the muxed strategy, replies are not discarded, but we can see a few SSL errors in the logs, because the effect is not just interleaved requests and responses, but requests can be sent concurrently on the same transport
Analyzing the code I found a big difference between IIOP and SSLIOP connections.
Both TAO_IIOP_Connector and TAO::SSLIOP::Connector inherit from TAO_Connector. In the former case, the transport creation passes through the base class, TAO_Connector::connect. This method has an infinte while loop that wraps all code. If an idle transport is found in the cache, it is returned. When a new Transport is created, it is not returned to the caller, instead it is cached and then the loop runs once again, and hopefully fetches the transport from the cache before a different thread snatches it.
The Transport is created by TAO_IIOP_Connector::make_connection, which adds the transport in the cache manager using the default sate, ENTRY_IDLE_AND_PURGABLE (a default paramter). The state of the transport is changed to BUSY later, by the method Transport_Cache_Manager_T::find_transport, which ends up calling find_i.
On the other hand, TAO::SSLIOP::Connector overwrites the connect method (SSLIOP_Connector.cpp line 83), which internally calls ssliop_connect (line 190). Neither the method connect, nor ssliop_connect, have the same loop as the base class. Instead, the transport is cached with the default state ENTRY_IDLE_AND_PURGABLE, and then returned and used as is
int retval =
this->orb_core ()->
lane_resources ().transport_cache ().cache_transport (desc,
transport);
With this pattern it is possible that a different thread finds the same transport in the cache, in idle state, and starts using it concurrently.
I tried a quick patch, adding the state ENTRY_BUSY as third parameter to the cache_transport invocation, in SSLIOP_Connector.cpp line 713, and our problem seems to be solved. All I saw seems to hint to a bug, but the code is 13 years old, so maybe there's something wrong in our configuration.
The text was updated successfully, but these errors were encountered:
Version
We are using the ACE ORB supported by OCI, https://theaceorb.com/, wich is TAO 2.2. But looking at the code here on github, it seems that the code is the same as the one on master branch here, with TAO 3.1.
Host machine and operating system
Built on an almalinux 9.2 docker image, run on Red Hat Enterprise Linux release 9.3
Compiler name and version (including patch level)
g++ (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
The $ACE_ROOT/ace/config.h file
The $ACE_ROOT/include/makeinclude/platform_macros.GNU file
Contents of $ACE_ROOT/bin/MakeProjectCreator/config/default.features
Our application acts as a server for various other programs, but also delegates a few functions to another service. In this interaction, the application is acting as a client towards the other service, and multiple threads can perform requests concurrently. The ORBs use an svc config file with the following content:
We've seing a few threads remaining stuck waiting for a replies, and these errors in the logs:
I've raised the log levels and investigated the issue. Whent the SSLIOP connector creates a new Transport, it is added to the Transport Cache Manager with state ENTRY_IDLE_AND_PURGABLE. Under heavy load, a different thread can fetch the same Transport from tha cache while the first thread is using it to send a request.
There are a few consequences:
Analyzing the code I found a big difference between IIOP and SSLIOP connections.
Both TAO_IIOP_Connector and TAO::SSLIOP::Connector inherit from TAO_Connector. In the former case, the transport creation passes through the base class, TAO_Connector::connect. This method has an infinte while loop that wraps all code. If an idle transport is found in the cache, it is returned. When a new Transport is created, it is not returned to the caller, instead it is cached and then the loop runs once again, and hopefully fetches the transport from the cache before a different thread snatches it.
The Transport is created by TAO_IIOP_Connector::make_connection, which adds the transport in the cache manager using the default sate, ENTRY_IDLE_AND_PURGABLE (a default paramter). The state of the transport is changed to BUSY later, by the method Transport_Cache_Manager_T::find_transport, which ends up calling find_i.
On the other hand, TAO::SSLIOP::Connector overwrites the connect method (SSLIOP_Connector.cpp line 83), which internally calls ssliop_connect (line 190). Neither the method connect, nor ssliop_connect, have the same loop as the base class. Instead, the transport is cached with the default state ENTRY_IDLE_AND_PURGABLE, and then returned and used as is
With this pattern it is possible that a different thread finds the same transport in the cache, in idle state, and starts using it concurrently.
I tried a quick patch, adding the state ENTRY_BUSY as third parameter to the cache_transport invocation, in SSLIOP_Connector.cpp line 713, and our problem seems to be solved. All I saw seems to hint to a bug, but the code is 13 years old, so maybe there's something wrong in our configuration.
The text was updated successfully, but these errors were encountered: