Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-DDS fails to transmit messages between containers on same Kubernetes pod [10123] #1633

Closed
stevewolter opened this issue Dec 10, 2020 · 10 comments · Fixed by #1637
Closed
Assignees

Comments

@stevewolter
Copy link

stevewolter commented Dec 10, 2020

Setup: We run two processes in two different containers in the same pod in Kubernetes. A Kubernetes pod is a set of Docker containers with the same external IP. Each Kubernetes container has its own PID namespace. Process A in container A is a DDS publisher, process B in container B is a subscriber.

Expected Behavior

Subscriber and publisher should be able to exchange messages.

Current Behavior

Participant matching fails because both participants end up with the same GUID. This can be tracked down to both processes ending up with the same PID, because each container in Kubernetes has its own PID namespace. The GUID is created in FastRTPS from:

  • the MD5 host's external network interfaces (same in both processes, because same pod)
  • the process ID (same in both processes, because each is PID 2 in its container)
  • the participant ID (same in both processes, because they're assigned sequentially)

The problem is even more insidious when participant IDs don't happen to match: In this case, both FastDDS instances switch to intra-process delivery (because host MD5 and process ID match), and no messages are transmitted even though PDP succeeds.

The problem goes away when switching to host pids (shareProcessNamespace: true in the pod YAML).

Why this is a problem

Kubernetes and other Docker-based orchestrators are getting more and more common. The PID has ceased to be a globally unique for kernels.

We fixed the problem internally by replacing GetPID() in RTPSDomain.cpp by a once-per-process call to rand(). I quickly wanted to give you heads-up and make this issue visible to others. ROS also ran into the same issue (ros2/rmw_fastrtps#349) without understanding it fully.

@fujitatomoya
Copy link
Contributor

ah, i see the reason now. ros2/rmw_fastrtps#349 is related to container but kuberntes actually. so i was confirming that only with containers, and that works okay since MD5 host's external network interfaces is different on each container. but once it comes to k8s with same pods, as you mentioned this will assign the same GUID.

@fujitatomoya
Copy link
Contributor

we've confirmed this problem, @stevewolter thanks for the heads-up!

@MiguelCompany MiguelCompany changed the title Fast-DDS fails to transmit messages between containers on same Kubernetes pod Fast-DDS fails to transmit messages between containers on same Kubernetes pod [10123] Dec 11, 2020
@MiguelCompany MiguelCompany self-assigned this Dec 11, 2020
@MiguelCompany
Copy link
Member

@stevewolter @fujitatomoya Please check #1637 for a possible solution

@fujitatomoya
Copy link
Contributor

@MiguelCompany appreciate for the quick response 👍 we will try that out, and get back to you.

@fujitatomoya
Copy link
Contributor

@MiguelCompany

we confirmed this PR works ros:foxy with Kubernetes (talker and listener in the same pod can communicate).

Note: this PR is based on master branch, there is merge conflict against v2.0.x(https://github.com/ros2/ros2/blob/32b29971204aee3b10afdfd4bc9dd4efa708439b/ros2.repos#L38-L41) so we cannot use v2.0.x branch at this moment.

@stevewolter
Copy link
Author

Thanks for the super-quick work, looks great!

@MiguelCompany
Copy link
Member

we confirmed this PR works ros:foxy with Kubernetes (talker and listener in the same pod can communicate).

@fujitatomoya Great! Thanks for checking!

Note: this PR is based on master branch, there is merge conflict against v2.0.x(https://github.com/ros2/ros2/blob/32b29971204aee3b10afdfd4bc9dd4efa708439b/ros2.repos#L38-L41) so we cannot use v2.0.x branch at this moment.

Yeah, we would have to backport the changes to 2.0.x, but I cannot give you an ETA now.

@MiguelCompany
Copy link
Member

I will keep this open till we have the backport in place.

@MiguelCompany MiguelCompany reopened this Dec 15, 2020
@MiguelCompany
Copy link
Member

Closing via #1643 and #1648

@fujitatomoya
Copy link
Contributor

thanks for the effort @MiguelCompany 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants