Skip to content

Conversation

@lukebakken
Copy link
Collaborator

@lukebakken lukebakken commented Nov 28, 2025

The current Dockerfile / docker-entrypoint.sh results in the following PIDs within a running RabbitMQ container:

PID CMD
  1 /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
 20 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/beam.smp -W w ...
 26 erl_child_setup 1024
 65 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 66 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 76 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/epmd -daemon
121 /bin/sh -s rabbit_disk_monitor

Note that the rabbitmq-server script remains running and is PID 1.

There was a long discussion about what results this particular setup could have in heavily-loaded k8s environments. This prompted me to look at the rabbitmq-server script and found that the behavior can be controlled via several env variables:

https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/scripts/rabbitmq-server#L97

Most notably, if RUNNING_UNDER_SYSTEMD is set to a value, the script will exec the Erlang VM. This PR sets that value, which results in the following PIDs in a container:

PID CMD
  1 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/beam.smp -W w ...
 25 erl_child_setup 1024
 64 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 65 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 75 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/epmd -daemon
120 /bin/sh -s rabbit_disk_monitor

The Erlang VM already gracefully stops RabbitMQ on SIGTERM, so there is no change in behavior.

Copy link
Collaborator

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RabbitMQ server script PID being 1 can lead to zombie processes, and RabbitMQ certainly can and should run under a regular PID.

@lukebakken lukebakken marked this pull request as draft November 28, 2025 21:32
@lukebakken
Copy link
Collaborator Author

lukebakken commented Nov 28, 2025

Thanks @michaelklishin. I marked this as a draft so I'd have some time to dig into that $detached variable in the rabbitmq-server script. Nothing actually sets it in the script so I'm just wondering about its provenance.

UPDATE: this commit removed code that used to detect if -detached was passed.

@lukebakken lukebakken force-pushed the lukebakken/use-detached branch from 77e86f8 to 9143b5d Compare November 28, 2025 22:33
@lukebakken lukebakken changed the title Set detached=true so beam.smp is PID 1 Set RUNNING_UNDER_SYSTEMD=true so beam.smp is PID 1 Nov 28, 2025
@lukebakken lukebakken marked this pull request as ready for review November 28, 2025 22:36
@lukebakken
Copy link
Collaborator Author

Actually, I think exporting RUNNING_UNDER_SYSTEMD=true makes more sense:

  • It's an all-caps variable, which usually indicates it comes from the environment.
  • It has been part of the rabbitmq-server script as long as detached as.

See this related PR - rabbitmq/rabbitmq-server#15036

@michaelklishin
Copy link
Collaborator

The rabbitmq-server part of the proposed changes was merged.

@Zerpet
Copy link

Zerpet commented Dec 1, 2025

RabbitMQ server script PID being 1 can lead to zombie processes, and RabbitMQ certainly can and should run under a regular PID.

This should not be a concern in containers, thanks to PID namespaces. Quoting from Linux manpage:

If the "init" process of a PID namespace terminates, the kernel
terminates all of the processes in the namespace via a SIGKILL
signal.

The change proposed by Luke aims to reduce an edge case in Kubernetes, where the Erlang VM has already exited, but the shell process hasn't, and there's a delay in the shell process exiting. This causes the kubelet to return the container as "still runing", which then translates to a Pod "stuck" in terminating state.

Correct me if I'm wrong, but I believe that Erlang VM reaps its own zombies, at least to some extent.

@michaelklishin
Copy link
Collaborator

Erlang/OTP largely avoids zombie child processes thanks to a dedicated process used to start such children: erl_child_setup.

In addition, the BEAM VM is PID 1 aware to at least some extent and sets SIGCHLD when running as PID 1.

@michaelklishin
Copy link
Collaborator

@lukebakken I asked around, and FWIW some core team members at VMware find this to be a reasonable change for an OCI.

Most do not have an opinion ;)

@lukebakken lukebakken force-pushed the lukebakken/use-detached branch from 9143b5d to d13ebdc Compare December 1, 2025 16:45
The current Dockerfile / docker-entrypoint.sh results in the following
PIDs within a running RabbitMQ container:

```
PID CMD
  1 /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
 20 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/beam.smp -W w ...
 26 erl_child_setup 1024
 65 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 66 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 76 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/epmd -daemon
121 /bin/sh -s rabbit_disk_monitor
```

Note that the `rabbitmq-server` script remains running and is PID 1.

There was a [long discussion](rabbitmq/cluster-operator#2012)
about what results this particular setup could have in heavily-loaded
k8s environments. This prompted me to look at the `rabbitmq-server`
script and found that the behavior can be controlled via several env
variables:

https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/scripts/rabbitmq-server#L97

Most notably, if `RUNNING_UNDER_SYSTEMD` is set to a value, the script
will `exec` the Erlang VM. This PR sets that value, which results in the
following PIDs in a container:

```
PID CMD
  1 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/beam.smp -W w ...
 25 erl_child_setup 1024
 64 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 65 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/inet_gethost 4
 75 /opt/erlang/lib/erlang/erts-15.2.7.4/bin/epmd -daemon
120 /bin/sh -s rabbit_disk_monitor
```

The Erlang VM already gracefully stops RabbitMQ on `SIGTERM`, so there
is no change in behavior.
@lukebakken lukebakken force-pushed the lukebakken/use-detached branch from d13ebdc to 72351b5 Compare December 1, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants