Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared memory transport design documents <master> [7932] #859

Merged
merged 10 commits into from Mar 16, 2020
@@ -0,0 +1,105 @@
# Interprocess Shared Memory model

## Overview

This document describes a model for interprocess shared memory communication of software components interchanging data messages. The objective of the model is to provide a shared memory based transport layer suitable for a Real-time Publish-Subscribe DDS (Data Distribution Service).

## Context

eProsima has received commercial proposals to implement a shared memory transport as an improvement of its FastRPTS product. The goals listed in this document are extracted from the customers proposals.

### Improvements over the standard network transports

* Reduce OS Kernel calls: This is unavoidable for transports like UDP / TCP (even with the loopback interface).
* Large message support: Network stacks needs to fragment large packets.
* Avoid data serialization / deserialization process: This is not possible in heterogeneous networks, but is possible in shared memory interprocess communication.
adolfomarver marked this conversation as resolved.
Show resolved Hide resolved
* Reduce memory copies: A buffer can be shared directly with several readers without additional copies.

### Objectives

* Increase the communications performance of processes in the same machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the definition by machine? including container which actually in the same machine?

this is also related to how it can detect if inter-process or not, but i guess that better to mention that also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true "machine" is not a precise definition, The idea is that shared-memory will improve communications between processes capable of sharing local memory, that running in the same operating system instance in a machine, but of course processes running in the same container or virtual machine but not intercontainer or virtual machines.
How could we well define this idea in a phrase?

* Create a portable shared memory library (Windows / Linux / MacOS).
* Documentation.
* Tests
* Examples / Tutorials.

## Architecture

### Design concepts

* **Segment**: Is a block of shared memory of fixed size that can be accessed from different processes. Shared memory segments have a global name so any process who knows the name can open the segment and map it in its address space.

* **SegmentId**: SegmentIds are names that uniquely identify shared memory segments, these names are 16 characters UUIDs.

* **Shared memory buffer**: Is a buffer allocated in the shared memory segment.

* **Buffer descriptor**: Shared memory buffers can be referenced by buffers descriptors, these descriptors act like pointers to a buffer that can be copied between processes with a minimum cost. A descriptor contains the SegmentId and the offset to the data from the base of the segment.

* **Shared memory port**: Is a communication channel identified by a port_id(uint32_t number). Through this channel, buffer descriptors are sent to other processes. It has a ring-buffer, in shared memory, where descriptors are stored. The same port can by opened by several processes for reading and writing operations. Multiple listeners can be registered in a port to be notified when descriptors are pushed to the ring-buffer. Multiple data producers can push descriptors to a port. The port contains an atomic counter with the number of listeners registered on it, each position in the ring-buffer has also a counter initialized to the number of listeners, as listeners read the descriptor, decrement the counter, so the ring-buffer position will be considered free when the counter is zero. The port also has an interprocess condition variable where the listeners will wait for incoming descriptors.

* **Listener**: Listeners listen to descriptors pushed to a port. The Listener provides an interface for the application to wait and access to the data referenced by the descriptors. When a consumer pops a descriptor from the port listener, look at the descriptor's SegmentId field to open the origin shared memory segment (if not already opened in this process), once the origin segment is mapped locally, the consumer is able to access the data using the offset field contained in the descriptor.

* **SharedMemoryManager**: Applications instantiate this object in order to access the shared memory resources described above. At least one per process memory manager is required. The manager provides functionality for the application to create shared memory segments, alloc data buffers in the segments, push buffer descriptors to shared memory ports and create listeners associated to a port.

### Example scenario
![](interprocess_shared_mem1.png)

Let's study the above example. There are three processes with a SharedMemManager per process, every process creates its own shared memory segment intended to store the buffers that will be shared with other processes.

P1, P2 and P3 processes are participants in a RTPS-DDS environment. Discovery of participants is done by using multicast, so we have selected shared memory port0 as "multicast" port for discovery, therefore, first thing all participants do is to open the shared memory port0. A listener attached to port0 is created too, by every participant, to read the incoming descriptors.

Each participant opens a port to receive unicast messages, ports 1, 2 and 3 respectively, and create listeners associated to those ports.

The first message the participants send is the multicast discovery message: "I'm here, and I am listening on portN". So they alloc a buffer in its local segment, write that information to the buffer and push the buffer's descriptor through the port0. Observe how port0's ring-buffer store the descriptors to Data1a(P1), Data2a(P2), Data3a(P3) after all processes have pushed their discovery descriptors.

After the discovery phase, participants know the other participants and their "unicast" ports so they can send messages to specific participants by pushing to the participant's unicast port.

Finally, let's observe how P1 is sharing Data1c with P2 and P3, this is done by pushing the buffer descriptor to P2 and P3 unicast ports. This way one shared memory buffer can be shared with several processes without making any copy of the buffer (just copy the descriptors). This is an important improvement with respect to transports like UDP and TCP.

### Design considerations

* **Minimize global interprocess locks**: Interprocess locks are dangerous because, in the case one of the involved processes crashes while holding an interprocess lock, all the collaborating processes could be affected. In this design, pushing data to a port and reading data from the port are lock-free operations. For performance reasons waiting for data on a port, when the port is empty, requires interprocess lock mecanishms like mutexes and condition variables. This is specially dangerous in multicast ports because if one listener crashes while waiting this could block the port for the rest of the listeners. More complex design features could be added to make the library fault-tolerance, but it will possibly be at a cost of losing performance.

* **Scalable number of processes**: Every process creates its own shared memory segments to store the locally generated messages. This is more scalable than having a global segment shared between all the involved processes.

* **Per application / process customizable memory usage**: Again, a local shared memory segment allows to adapt the size of the segments to the requirements of the application. Imagine for example an application sending two types of data: Video and status info. It could create a big segment to store video frames, and a small segment for the status messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there already plan how it provide the interface to configure the memory space? API or configuration file to load?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory space can be configured through the transport descriptor, that can be done in the application's C++ code or with an external XML file using the fastrtps XML profiles feature.


### Future improvements

* **Fault tolerance**: As stated in the design considerations, the possibility of a process crashing holding interprocess resources is real. Implementing fault tolerance for these cases is not an easy task. Timeouts, keep alive checks and garbage collectors are some of the things that could be added to the initial design in order to achieve fault tolerance. This will be considered in future revisions.

### Mapping the design to FastRTPS

#### Transport Layer

* **Locators**: LOCATOR_KIND_SHM Is defined to identify shared memory transport endpoints. Locator_t fields are filled in this way:
* kind: 16 (Is the RTPS vendor-specific range).
* port: The locator's port contains the shared memory port_id.
* address: The whole address is set to 0 except for the first byte that is used to mark unicast (address[0]='U') or multicast (address[0]='M').

* **SharedMemTransportDescriptor**: The following values can be customized:
* segment_size: Size of the shared memory segment reserved by the transport.
* port_queue_capacity: Size, in number of messages, of the shared memory port message queue.
* port_overflow_policy: DISCARD or FAIL.
* segment_overflow_policy: DISCARD or FAIL
* max_message_size: By default max_message_size will be the segment_size, but is possible to specify a value <= segment_size. In that case fragmentation could occur.

* **Default metatraffic multicast locator**: One locator, the port will be selected by the participant (will be the same as in the RTPS standard for UDP).

* **Default metatraffic unicast locator**: One locator, the port will be selected by the participant (default port_id will be the same as in the RTPS standard for UDP).

* **Default output locator**: There will be no default output locator.

* **OpenInputChannel**: A SharedMemChannelResource instance will be created. An opened channels vector is maintained, if the same input locator is opened several times the channel instance is reused.

* **OpenOutputChannel**: There will be only one SharedMemSenderResource instance per transport instance, an unordered_map of opened ports will be maintained by the SharedMemSenderResource object in order to match the destination locators to the shared memory ports.

## Class design
![](interprocess_shared_mem2.png)

#### RTPS Layer

In FastRTPS transports are associated to participants, so the new SHM transport can used, by adding an instance of a SharedMemTransportDescriptor class to the list of participant's user transports.
So there is a shared memory segment per participant that will be shared by all participant's publishers.

Transport selection: As RTPSParticipant is able to have multiple transports, a transport selection mecanishm is necessary when communicating with other participants reachable by several transports. The defined behaviour here is: If the participants involved are in the same host and both have SHM transport configured, then SHM transport is selected for all communications between those participants.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@@ -0,0 +1,68 @@
```plantuml
@startuml

hide empty members

!define struct(x) class x << (S,CadetBlue) >>

class SharedMemTransport {
..Channel handling..
+OpenOutputChannel()
+OpenInputChannel()
+Send()
..
-input_channels_
}

struct(SharedMemTransportDescriptor) {
..Interface implementation..
+min_send_buffer_size()
+max_message_size()
..Additional configuration..
+segment_size
+port_queue_capacity
+port_overflow_policy
}

class ChannelResource {

}

class SharedMemChannelResource {
+Receive()
}

class SenderResource {
}

class SharedMemSenderResource {
-locators_ports_map
}

class SharedMemManager {
+create_segment()
+open_port()
}

class SharedMemManager::Port {
+push()
+create_listener()
}

class SharedMemManager::Listener {
+pop()
}

TransportInterface ()-- SharedMemTransport
TransportDescriptorInterface ()-- SharedMemTransportDescriptor
ChannelResource <|-- SharedMemChannelResource
SharedMemTransport o-- SharedMemManager
SharedMemTransport "1" *-- "n" SharedMemChannelResource
SenderResource <|-- SharedMemSenderResource
SharedMemTransport "1" *-- "1" SharedMemSenderResource
SharedMemChannelResource "1" *-- "1" SharedMemManager::Listener : listen on
SharedMemManager::Listener "1" o-- "1" SharedMemManager::Port
SharedMemSenderResource "1" *-- "n" SharedMemManager::Port : send to

@enduml
```
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@@ -0,0 +1,127 @@
# Overview

DDSi-RTPS(Real-Time Publish-Subscribe) is the **protocol** that enable to support vendor implementation. And RTPS specifies implementaion and message packet format and other stuff as well. On this proposal, it will descibe how we could integrate **shared memory as transport** for RTPS layer based on.
adolfomarver marked this conversation as resolved.
Show resolved Hide resolved

# Objective

Implement a shared Memory transport using the current Fast-RTPS transport API & Architecture: The transport will be used as any other transport (including serialization and RTPS encapsulation). This approach is used in other DDS implementations such as commercial release for OpenSplice and RTI Connext DDS. This is mostly related to ROS2 RMW transport, but it can be used in any DDS application or product.

# Requirements

- Multi-Platform support (Windows, Linux and MacOS) along with ROS2.
- Design and Architecture Description
- Source Code [Fast-RTPS](https://github.com/eProsima/Fast-RTPS)
- Documents for user manual and tutorials.
- ROS2/RMW compatible interface (do not break ROS2 userland)
- Latency/Bandwidth improvement results.
- Intra/Inter-Process Communication
adolfomarver marked this conversation as resolved.
Show resolved Hide resolved

# Architecture / Design

## Interoperability

There is no gurantee that the whole system is constructed with eProsima Fast-RTPS, so multiple DDS implementaions may be used in a distributed system.
To keep the interoperability between different vendor DDS implementations, the shared memory transport feature has to be implemented with the following:

- All communications must take place using RTPS Messages.
- All implementations must implement the RTPS Message Receiver.
- Simple Participant and Endpoint Discovery Protocols.

## Discovery

Participant and Endpoint Discovery(PDP, EDP such as metatrrafic transmittion) will be kept as is in the current code, since this negotiation does not require shared memory use cases (all of the RTPS protocol stays put).

## Memory Management

- Dynamic Configuration

Shared memory transport should be dynamically configured when necessary, that is, when the reader exists on the same host system than the writer. This is mandatory to reduce unnecessary message passing via shared memory.

This can be detected using the GuidPrefix_t information in the locator, which includes vendor id, host id, process id and participant id.

```
struct RTPS_DllAPI GuidPrefix_t
```

- Lifecycle

Shared memory lifecycle is managed by the writer corresponding to HistoryCache more likely QoS setting.
adolfomarver marked this conversation as resolved.
Show resolved Hide resolved

- Owner

Shared memory is owned by the writer. The writer is responsible of managing the shared memory (creation and removal) according to the aforementioned lifecycle.

- Version

Versioning must be used to check if the implementation supports the shared memory feature or not.

## Shared Memory Framework

Since this whole project is targeted to be multi-platform, it is mandatory not to use system dependent shared memory frameworks or primitives.

- C/C++ implementation
- Primitive baseline implementation, can be used for multi-platform.
- But most parts need to be implemented by our own.
- Also good affinity to intra-process, just use std::shared_ptr.
- POSIX IPC shared memory (shm_open)
- This is only for Unix system, so not good affinity for Windows/Mac.
- **boost::interprocess**
- Surely multi-platform, this is generic and up-to-date interfaces.
- Generic shared memory interface
- Emulation layer available for windows and system v(xsi) if necessary.
- Memory mapped file (this can be useful to use ramfs to refresh system restart or reboot)
- Shared memory range can be truncated dynamically.
- Container mapping with allocator
- File locking (this could be useful to exclude access to the memory mapped shared memory)
- Writer has read_write mode and reader only has read_only mode.
- ***But cannot be extremly optimized via system calls, such as hugetlb to reduce TLB miss hit and pagefaults.***


## Event Notification

Writer needs to notify right after the shared memory is ready to be read on reader side. Then reader will be notified that data is ready to read out.

- RTPS extensible message
- Submessages with ID's 0x80 to 0xff (inclusive) are vendor-specific
- But probably we should avoid sending messages via network.
- **boost::interprocess**
- Condition: after data is set by writer, notify the subscribers via condition variable on shared memory (named under topic name?).
- Semaphore: after data is set by writer, post to notify the subscribers via semaphore on shared memory. Can be used to control the number of subscribers to read via wait API (named under topic name?).

## Security

- Encryption / Decryption:
- Shared memory will be used as transport only, so encrypted data will be stored on shared memory. Therefore, this does not affect anything.
- Access Permission:
- Can we support access permission? Like file system based access permission? This needs to be considered.
- Since security implementation is different among platforms, boost::interprocess does not try to standardize the access permission control. This is responsibility of the implementation.

## Quality of Service

Since this is just a new implementation of the transport layer for shared memory, where all the rest of the RTPS protocol stays put, we expect full compatibility of configuration with Fast-RTPS.

Nevertheless, this needs to be further considered during the detailed design and implementation.

# Considerations

## Intra Process Communication

There is an intra-process communication feature in ROS2 rclcpp and rcl layer (basically zero-copy based). This has nothing to do with DDS implementation, but needs to be considered to ensure this feature doesn't break when implementing the DDS shared memory transport. Refer to [Intra-Process-Communication](https://index.ros.org//doc/ros2/Tutorials/Intra-Process-Communication/)

Intra-process communication does not support QoS specification, e.g) RMW_QOS_POLICY_HISTORY_KEEP_ALL, !RMW_QOS_POLICY_DURABILITY_VOLATILE is not supported.

Intra-process communication uses a ring buffer that is internally implemented [mapped_ring_buffer.hpp](https://github.com/ros2/rclcpp/blob/master/rclcpp/include/rclcpp/mapped_ring_buffer.hpp).

**this implementataion will stay on rclcpp for other implementations, but will be decricated for eProsima Fast-RTPS use-cases basically.**
adolfomarver marked this conversation as resolved.
Show resolved Hide resolved

## Container Boundary

Using containers basically means to divide the user space into independent sections, so shared memory should not be used beyond the container boundary.

This can be actually done just checking IP addresses to see if both participants are in the same host or not. So we don't expect addtional requirements, it just appears to be another network interface.

# Reference

[ROSConJP 2019 Lightning Talk Presentation](https://discourse.ros.org/uploads/short-url/1SbbxgRCiM6NH2BuSCqNAe0aogx.pdf)
[Boost.Interprocess](https://www.boost.org/doc/libs/1_71_0/doc/html/interprocess.html)