-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashing with iceoryx sharedmemory enabled. #1026
Comments
@jwrl7 I have to take a closer look at the error logs but the first appears to be running out of chunks but you just need very small chunks of size a little more than 64 Byte.
We need a little more than 64 Byte due to internal additional information in each chunk (it is unfortunate that this is not transparent, there will be improvements in this regard). The second error log (separate run I assume) is the consequence of RouDi terminating I assume, but I would need more details. The application then fails to communicate with RouDi
and terminates as well. To my knowledge the internal socket communication with Roudi breaks down and this is the result. RouDi is therefore some kind of single point of failure, as it controls shared memory communication. It is up for debate what should happen when RouDi terminates though, but as it is now all other applications relying on it are compromised and will output errors/terminate. So you can optimize your memory config by adding
to have 1000000 chunks (set this number according to your system) of size 128 at our disposal (which should be able to store 64 payload bytes + the hidden extra information). Note that the size should be divisible by 32 for technical reasons. I think you only changed the size, but what is more important if you run out chunks is to increase the corresponding chunk count. Note that it will always use the smallest chunk in which the data fits, and if there is none you run into the error in the log. There are considerations to improve the ability to configure this. We plan to also add the option to just define how much total shared memory should be used (say 1GB) and the system will optimize/allocate in a (semi)-optimal way on its own. Then there would be no need to specify individual sizes. It is a matter of time and priorities though (it is high on my priority list). Regarding the options
The extra options will soon disappear in ROS 2 Rollling, they should be derived internally from QoS settings. If you need more information about this I can elaborate further, but as I said those options will disappear. Finally I run into a phenomenon with the ROS 2 exectuor (which implicitly runs) myself: The reason why processing there is a problem is that there will be used memory chunks in iceoryx that are not recyled until the ROS executor calls a specific function which only happens after the subscription callback. If in the meantime (expensive computation) many new samples arrive, those are not available for sending new data. I can elaborate more and the user side solution is not ideal, ideally the internal queues do all the work with KeepLast. After this is fixed in Rolling (in 2 weeks maybe), using ROS 2 Rolling would be an option as well. Let me know whether you need further assistance. |
I really appreciate the elaborate response. These all make since and will give the recommended settings a try. While I was typing this, and before I making the recommended changes, I did get another crash so here is more info:
ghost@nvidia-desktop:~$ tail -f iox_roudi.log
ghost@nvidia-desktop:~$ sudo systemctl status iox_roudi.service Nov 11 10:07:58 nvidia-desktop systemd[1]: Starting Ghost ICE ORYX...
|
@MatthiasKillat @sumanth-nirmal I think this could be the issue we discussed this week. When the reader cache has an overflow the cleanup callback provided by rmw_cyclone might not properly release the loan. |
I am closing this because no new information is forthcoming and it seems this was not an issue with Cyclone but with iceoryx or the ROS2 RMW. Feel free to open an issue on the relevant project if you still run into issues. |
I think this is related to the issue of chunks not being freed correctly in (For purposes of tracking) |
On several occasion I have seen iceory crash the system due what looks like memory access or out of memory errors.
The latest test I doubled the default
segment.mempool
from size = 16448 to 32896 and ran into the error below. The log is if after the system had been running for about 6 hours.System is Nvidia Xavier running JP 4.4.1 and ROS2 Galactic.
Using default
First basic question: What is the best way to changed the default paramters based on our system? Would different paramters help in these cases.
[mav_external_node-6] 2021-11-10 16:56:51.211 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS [ghost_connector_can-11] Mempool [m_chunkSize = 32936, numberOfChunks = 32768, used_chunks = 32768 ] has no more space left [ghost_connector_can-11] 2021-11-10 16:56:51.211 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 64The following mempools are available: MemPool [ ChunkSize = 32936, ChunkPayloadSize = 32896, ChunkCount = 32768 ] [ghost_connector_can-11] 2021-11-10 16:56:51.211 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS [component_container-1] Mempool [m_chunkSize = 32936, numberOfChunks = 32768, used_chunks = 32768 ] has no more space left [component_container-1] 2021-11-10 16:56:51.211 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 64The following mempools are available: MemPool [ ChunkSize = 32936, ChunkPayloadSize = 32896, ChunkCount = 32768 ] [component_container-1] 2021-11-10 16:56:51.211 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS [ros2_mscl_node-2] 2021-11-10 16:56:51.209 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 64The following mempools are available: MemPool [ ChunkSize = 32936, ChunkPayloadSize = 32896, ChunkCount = 32768 ] [ros2_mscl_node-2] 2021-11-10 16:56:51.212 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS [mav_external_node-6] Mempool [m_chunkSize = 32936, numberOfChunks = 32768, used_chunks = 32768 ] has no more space left [mav_external_node-6] 2021-11-10 16:56:51.212 [ Error ]: MemoryManager: unable to acquire a chunk with a chunk-payload size of 64The following mempools are available: MemPool [ ChunkSize = 32936, ChunkPayloadSize = 32896, ChunkCount = 32768 ] [mav_external_node-6] 2021-11-10 16:56:51.212 [Warning]: ICEORYX error! MEPOO__MEMPOOL_GETCHUNK_POOL_IS_RUNNING_OUT_OF_CHUNKS [ghost_connector_can-11] Mempool [m_chunkSize = 32936, numberOfChunks = 32768, used_chunks = 32768 ] has no more space left
Here are some other example of system crash when the segment.pool size = 16448. Basically to replicate, I just let the system run and eventually this will happen.
ghost@nvidia-desktop:~$ ros2 topic echo /gx5/nav/odom 1636141734.074318 [217] ros2: using network interface eth0 (udp/192.168.168.105) selected arbitrarily from: eth0, eth1, docker0 /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/shared_memory_object/shared_memory.cpp:149 { bool iox::posix::SharedMemory::open(int, mode_t, uint64_t) } ::: [ 2 ] No such file or directory /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/shared_memory_object/shared_memory.cpp:149 { bool iox::posix::SharedMemory::open(int, mode_t, uint64_t) } ::: [ 2 ] No such file or directory Shared Memory does not exist. Unable to create shared memory with the following properties [ name = /iceoryx_mgmt, access mode = AccessMode::READ_WRITE, ownership = OwnerShip::OPEN_EXISTING, mode = 0000, sizeInBytes = 60138536 ] Unable to create SharedMemoryObject since we could not acquire a SharedMemory resource Unable to create a shared memory object with the following properties [ name = /iceoryx_mgmt, sizeInBytes = 60138536, access mode = AccessMode::READ_WRITE, ownership = OwnerShip::OPEN_EXISTING, baseAddressHint = 0, permissions = 0000 ] 2021-11-05 15:48:54.080 [ Error ]: ICEORYX error! POSH__SHM_APP_MAPP_ERR python3: /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/error_handling/error_handling.cpp:56: static void iox::ErrorHandler::ReactOnErrorLevel(iox::ErrorLevel, const char*): Assertion
false' failed.Aborted (core dumped)
`
2021-11-04 16:54:57.696 [Warning]: Error in sending keep alive [ghost_connector_can-11] internal logic error in unix domain socket "/tmp/roudi" occurred ::posix::IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [static_transform_publisher-12] internal logic error in unix domain socket "/tmp/roudi" occurred [static_transform_publisher-12] 2021-11-04 16:54:57.712 [Warning]: Error in sending keep alive [async_mav_comms_node-4] internal logic error in unix domain socket "/tmp/roudi" occurred [async_mav_comms_node-4] 2021-11-04 16:54:57.728 [Warning]: Error in sending keep alive [mpc_ros_planner-7] internal logic error in unix domain socket "/tmp/roudi" occurred [mpc_ros_planner-7] 2021-11-04 16:54:57.732 [Warning]: Error in sending keep alive [ros2_mscl_node-2] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix::IpcC [ros2_mscl_node-2] internal logic error in unix domain socket "/tmp/roudi" occurred [ros2_mscl_node-2] 2021-11-04 16:54:57.736 [Warning]: Error in sending keep alive [gps_to_utm_publisher_node-10] internal logic error in unix domain socket "/tmp/roudi" occurred [gps_to_utm_publisher_node-10] 2021-11-04 16:54:57.737 [Warning]: Error in sending keep alive [mission_control_node-9] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix :IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [mission_bridge_node-8] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix: [ghost_runner_node-3] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix::IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [mav_external_node-6] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix::IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [mpc_ros_planner-7] [sdkUpdateA] ERROR!! ERROR!! updateAtime was too long: 4876 out of 2000 [component_container-1] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix: [component_container-1] 2021-11-04 16:54:57.948 [Warning]: Error in sending keep alive [mpc_ros_planner-7] [sdkUpdateA] ERROR!! ERROR!! updateAtime was too long: 2636 out of 2000 [async_mav_control_node-5] internal logic error in unix domain socket "/tmp/roudi" occurred [async_mav_control_node-5] 2021-11-04 16:54:57.989 [Warning]: Error in sending keep alive [ghost_connector_can-11] internal logic error in unix domain socket "/tmp/roudi" occurred [ghost_connector_can-11] 2021-11-04 16:54:57.995 [Warning]: Error in sending keep alive internal logic error in unix domain socket "/tmp/roudi" occurred 2021-11-04 16:54:58.003 [Warning]: Error in sending keep alive [static_transform_publisher-12] internal logic error in unix domain socket "/tmp/roudi" occurred [static_transform_publisher-12] 2021-11-04 16:54:58.013 [Warning]: Error in sending keep alive [ghost_connector_can-11] [INFO] [1636059298.016157833] [ghost_connector_can_node]: ghost_bms_interfaces.msg.GhostBMSInfo(state_of_charge=254, temperature=26, voltage=40494, current=-8910, sys_stat=128) [mpc_ros_planner-7] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix::IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [mpc_ros_planner-7] internal logic error in unix domain socket "/tmp/roudi" occurred [mpc_ros_planner-7] 2021-11-04 16:54:58.033 [Warning]: Error in sending keep alive [ros2_mscl_node-2] /home/ghost/builds/uWbfTNKP/0/ghostrobotics/autonomy/galactic_ws/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/unix_domain_socket.cpp:254 { iox::cxx::expected<iox::posix::IpcC [ros2_mscl_node-2] internal logic error in unix domain socket "/tmp/roudi" occurred :posix::IpcChannelError> iox::posix::UnixDomainSocket::timedSend(const string&, const iox::units::Duration&) const } ::: [ 107 ] Transport endpoint is not connected [gps_to_utm_publisher_node-10] internal logic error in unix domain socket "/tmp/roudi" occurred [mission_control_node-9] internal logic error in unix domain socket "/tmp/roudi" occurred [mission_bridge_node-8] internal logic error in unix domain socket "/tmp/roudi" occurred
The text was updated successfully, but these errors were encountered: