New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mosquitto 1.5.1 becomes unresponsive after max open files is exceeded #948

Closed
wiebeytec opened this Issue Sep 4, 2018 · 18 comments

Comments

Projects
None yet
4 participants
@wiebeytec

wiebeytec commented Sep 4, 2018

Mosquitto 1.4.15 would give 'access denied by tcpd' or something when you exceed the maximum open files (ulimit -n). Mosquitto 1.5(.1) seems to have the same limit, but no such log message appears. Instead, it jumps to 100% CPU and messages aren't getting through anymore. The threshold is quite sudden, from almost no CPU to a lot. Is it related to the E_POLL change?

I used this test script:

#!/bin/bash

SUBSCRIBERS=1500

[ -z "$wiebepassword" ] && echo "set variable wiebepassword" && exit 1

set -u

echo "PID: $$"

declare -a pids

trap kill_all_children SIGINT SIGTERM

for i in `seq 1 $SUBSCRIBERS`; do
  echo "Starting subscriber $i"
  timeout 300 mosquitto_sub -v -h mqtt.myserver.com -u wiebe@example.nl -P "$wiebepassword" --port 18883 --cafile /home/wiebe/mqtt/ca.crt -t "N/$RANDOM/#" > /dev/null & pid=$!
  pids+=($pid)
  #sleep 0.03
done

kill_all_children()
{
  for p in "${pids[@]}"; do
    kill -SIGTERM "$p"
  done
}

echo "$SUBSCRIBERS mosquittos started. Waiting for ctrl-c..."
wait

Without publishing, you can see a jump to 100% CPU. If you make 950 subscribers (with the default file limit of 1024), there is very little CPU.

To still test publishing (just an example):

mosquitto_pub -h mqtt.myserver.com -u 'wiebe@example.nl' -P "$mqttpassword" --port 18883 --cafile ca.crt -t 'N/905932684246/in' -m "message"

(my user is admin, so I see all publishes)

Open file limit exhaustion may need to be dealt with more gracefully, but also the ulimit may need to be set better. Perhaps the process can increase it (based on a config) before switching to user mosquitto? I currently had to hack the init script (also for 1.4.15), and set it in Supervisor (I run two daemons).

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 6, 2018

Contributor

Try to compile with the option WITH_EPOLL=no in order to clarify the cause.
Or, does the directive max_connections of conf become your remedy?

Contributor

toast-uz commented Sep 6, 2018

Try to compile with the option WITH_EPOLL=no in order to clarify the cause.
Or, does the directive max_connections of conf become your remedy?

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 7, 2018

I expected to get the same behavior back without EPOLL ('access denied by tcpd'), but the behavior is actually the same (it does to 100% CPU when max open files is exceeded). Probably error handling is missing when opening a socket, and it retries?

My point about a config directive is that Mosquitto needs to be aware of max open files; to to give useful log messages and instructions/documentation to the user.

Take Nginx, which really has been engineered for this. It has worker_rlimit_nofile, and distro packaging has files like /etc/default/nginx to set the ulimit -n before the daemon is started.

wiebeytec commented Sep 7, 2018

I expected to get the same behavior back without EPOLL ('access denied by tcpd'), but the behavior is actually the same (it does to 100% CPU when max open files is exceeded). Probably error handling is missing when opening a socket, and it retries?

My point about a config directive is that Mosquitto needs to be aware of max open files; to to give useful log messages and instructions/documentation to the user.

Take Nginx, which really has been engineered for this. It has worker_rlimit_nofile, and distro packaging has files like /etc/default/nginx to set the ulimit -n before the daemon is started.

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 8, 2018

Contributor

Thanks. Understood that EPOLL was not the cause.
But I've not understood yet that max_connections didn't become your remedy. I think the max_connections makes your system safe under your test scenario. I've misunderstood?

Contributor

toast-uz commented Sep 8, 2018

Thanks. Understood that EPOLL was not the cause.
But I've not understood yet that max_connections didn't become your remedy. I think the max_connections makes your system safe under your test scenario. I've misunderstood?

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 8, 2018

Contributor

p.s.
Nginx has both of worker_rlimit_nofile and worker_connections maybe due to the complicated features of nginx and http-portocol. Actually the recommendation of worker_rlimit_nofile is two or three times of worker_connections. On the other hand, because the features of mosquitto and mqtt-protocol is more simple, I guess max_connections only itself becomes enough solution.

Contributor

toast-uz commented Sep 8, 2018

p.s.
Nginx has both of worker_rlimit_nofile and worker_connections maybe due to the complicated features of nginx and http-portocol. Actually the recommendation of worker_rlimit_nofile is two or three times of worker_connections. On the other hand, because the features of mosquitto and mqtt-protocol is more simple, I guess max_connections only itself becomes enough solution.

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 8, 2018

Adding max_connections doesn't solve anything in and of itself. If you set it to 10000, you'll still have the 100% CPU issue if you also don't make ulimit -n higher. It needs some additional logic: failure to start if you set max_connections higher than ulimit -n, useful log messages, etc.

wiebeytec commented Sep 8, 2018

Adding max_connections doesn't solve anything in and of itself. If you set it to 10000, you'll still have the 100% CPU issue if you also don't make ulimit -n higher. It needs some additional logic: failure to start if you set max_connections higher than ulimit -n, useful log messages, etc.

@karlp

This comment has been minimized.

Show comment
Hide comment
@karlp

karlp Sep 8, 2018

Contributor

I agree that mosquitto should give you a better error message if it can detect that it was out of files, but I don't think it's any of mosquitto's business to go around validating ulimits vs connection settings of it's own. Mosquitto should limit itself to what it's been configured to do, but what the external container rules are are none of it's concern.

Certainly, simply disabling ulimits wholesale in init scripts goes against the entire purpose of ulimits.

Contributor

karlp commented Sep 8, 2018

I agree that mosquitto should give you a better error message if it can detect that it was out of files, but I don't think it's any of mosquitto's business to go around validating ulimits vs connection settings of it's own. Mosquitto should limit itself to what it's been configured to do, but what the external container rules are are none of it's concern.

Certainly, simply disabling ulimits wholesale in init scripts goes against the entire purpose of ulimits.

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 8, 2018

Contributor

I am at the side of @karlp . Mosquitto's behavior after max open files is an issue. But it is minor because setting max_connections properly becomes the remedy for @wiebeytec 's concern.

Contributor

toast-uz commented Sep 8, 2018

I am at the side of @karlp . Mosquitto's behavior after max open files is an issue. But it is minor because setting max_connections properly becomes the remedy for @wiebeytec 's concern.

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 8, 2018

Certainly, simply disabling ulimits wholesale in init scripts goes against the entire purpose of ulimits.

I disagree. What if you want to service 2000 clients with the default Mosquitto init script? You just can't without hacking then. Setting the ulimit is next to impossible for a daemon that way, because they are set at login. That's why Nginx packaging in Ubuntu/Debian provides /etc/default/nginx with a way to make the init script set ulimit.

wiebeytec commented Sep 8, 2018

Certainly, simply disabling ulimits wholesale in init scripts goes against the entire purpose of ulimits.

I disagree. What if you want to service 2000 clients with the default Mosquitto init script? You just can't without hacking then. Setting the ulimit is next to impossible for a daemon that way, because they are set at login. That's why Nginx packaging in Ubuntu/Debian provides /etc/default/nginx with a way to make the init script set ulimit.

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 9, 2018

Contributor

@wiebeytec , could you clarify what you insist now on this issue?

  1. Bug for the behavior after max open files. It's your original insistence. Keep the subject of this issue.
  2. New feature request for adding the option such as max_openfiles. It's your current insistence. Change the subject of this issue.

Thereby I will add labels on this issue, although I'm afraid I cannot promise fixing.

Contributor

toast-uz commented Sep 9, 2018

@wiebeytec , could you clarify what you insist now on this issue?

  1. Bug for the behavior after max open files. It's your original insistence. Keep the subject of this issue.
  2. New feature request for adding the option such as max_openfiles. It's your current insistence. Change the subject of this issue.

Thereby I will add labels on this issue, although I'm afraid I cannot promise fixing.

@karlp

This comment has been minimized.

Show comment
Hide comment
@karlp

karlp Sep 9, 2018

Contributor

@wiebeytec methods of customizing init scripts is simply out of scope. By all means distro packaging should make it easy to set ulimits, but mosquitto shouldn't be trying to circumvent distro mechanisms for setting ulimits (and cgroups and all other system wide limits and restrictions)

Contributor

karlp commented Sep 9, 2018

@wiebeytec methods of customizing init scripts is simply out of scope. By all means distro packaging should make it easy to set ulimits, but mosquitto shouldn't be trying to circumvent distro mechanisms for setting ulimits (and cgroups and all other system wide limits and restrictions)

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 11, 2018

@toast-uz the bug is that Mosquitto crashes to 100% CPU when you exceed your open file limit. It should detect this, and present the user with log messages, otherwise it's mysterious behavior. Whether or not to also implement max_connections, or whether that is a separate feature request I'll leave to you, but to me, it's part of the same.

As for the ulimits: the Mosquitto debs do provide an init script, which is where ulimit needs to be set based on something configurable (like /etc/default/nginx). This is how the Nginx init scripts do it too. This is not circumventing distro mechanisms, this is the only way, because the init script is provided by the deb. As I said before, you simply cannot run more than 1024 (the default in most cases) clients if the Mosquitto init script doesn't facilitate this (aside from hacking it, or using Supervisor to run it instead, or something).

wiebeytec commented Sep 11, 2018

@toast-uz the bug is that Mosquitto crashes to 100% CPU when you exceed your open file limit. It should detect this, and present the user with log messages, otherwise it's mysterious behavior. Whether or not to also implement max_connections, or whether that is a separate feature request I'll leave to you, but to me, it's part of the same.

As for the ulimits: the Mosquitto debs do provide an init script, which is where ulimit needs to be set based on something configurable (like /etc/default/nginx). This is how the Nginx init scripts do it too. This is not circumventing distro mechanisms, this is the only way, because the init script is provided by the deb. As I said before, you simply cannot run more than 1024 (the default in most cases) clients if the Mosquitto init script doesn't facilitate this (aside from hacking it, or using Supervisor to run it instead, or something).

@karlp

This comment has been minimized.

Show comment
Hide comment
@karlp

karlp Sep 12, 2018

Contributor

@wiebeytec what I mean is, take that up with debian packaging, it's out of scope of mosquitto itself. Of course your distro should let you change these settings, I'm certainly not saying that.

Contributor

karlp commented Sep 12, 2018

@wiebeytec what I mean is, take that up with debian packaging, it's out of scope of mosquitto itself. Of course your distro should let you change these settings, I'm certainly not saying that.

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 12, 2018

But the deb, and init scripts, is provide by the Mosquitto authors here, albeit outdated at the moment.

wiebeytec commented Sep 12, 2018

But the deb, and init scripts, is provide by the Mosquitto authors here, albeit outdated at the moment.

@toast-uz

This comment has been minimized.

Show comment
Hide comment
@toast-uz

toast-uz Sep 13, 2018

Contributor

@wiebeytec , thanks. I've labeled.

Contributor

toast-uz commented Sep 13, 2018

@wiebeytec , thanks. I've labeled.

@ralight ralight added this to the 1.5.2 milestone Sep 18, 2018

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 18, 2018

Contributor

It's interesting that there is this different behaviour. The difference you're seeing is down to compiling with WITH_WRAP=yes or no, the default straight from the repo/tarball is no. When using WITH_WRAP=no, a new incoming connection proceeds like this:

  • The socket is accepted with accept(), this succeeds, unless there are no more fds available, in which case it fails
  • If it fails, we return an error but the new connection is still on the listening socket queue, so we very quickly loop back round to try again - which fails again and again, hence the high CPU usage

If we have WITH_WRAP=yes, connections look like this:

  • The socket is accepted with accept(), this succeeds (all other errors ignored)
  • We attempt to make a tcp-wrappers check - this fails if the last accept() call took the last available fd. If it fails, we close the new socket and return with an error. The same situation as above, but with the important difference that the incoming connection has been removed from the listener queue.

I've replicated this behaviour by keeping a spare socket around in case of fd exhaustion. If I get EMFILE when trying to accept(), then I close the spare socket, accept the incoming connection, close the incoming connection, then reopen my spare socket ready for the next time.

Contributor

ralight commented Sep 18, 2018

It's interesting that there is this different behaviour. The difference you're seeing is down to compiling with WITH_WRAP=yes or no, the default straight from the repo/tarball is no. When using WITH_WRAP=no, a new incoming connection proceeds like this:

  • The socket is accepted with accept(), this succeeds, unless there are no more fds available, in which case it fails
  • If it fails, we return an error but the new connection is still on the listening socket queue, so we very quickly loop back round to try again - which fails again and again, hence the high CPU usage

If we have WITH_WRAP=yes, connections look like this:

  • The socket is accepted with accept(), this succeeds (all other errors ignored)
  • We attempt to make a tcp-wrappers check - this fails if the last accept() call took the last available fd. If it fails, we close the new socket and return with an error. The same situation as above, but with the important difference that the incoming connection has been removed from the listener queue.

I've replicated this behaviour by keeping a spare socket around in case of fd exhaustion. If I get EMFILE when trying to accept(), then I close the spare socket, accept the incoming connection, close the incoming connection, then reopen my spare socket ready for the next time.

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 18, 2018

Contributor

Could you take a look and close this if you're happy?

Contributor

ralight commented Sep 18, 2018

Could you take a look and close this if you're happy?

ralight added a commit that referenced this issue Sep 18, 2018

@ralight ralight closed this in ee8e20d Sep 19, 2018

@wiebeytec

This comment has been minimized.

Show comment
Hide comment
@wiebeytec

wiebeytec Sep 19, 2018

So the builds provided by the Mosquitto deb repo do have tcp wrapping? What is the reasoning behind it?

I do know that 1.5.1 (built from source) has significantly higher performance than the pre-built 1.4 release. Connecting thousands of idle clients took 50% CPU with Mosquitto 1.4. With 1.5.1, it's about 5%. E_POLL is faster, obviously, but is TCP wrapping likely also detrimental to performance?

wiebeytec commented Sep 19, 2018

So the builds provided by the Mosquitto deb repo do have tcp wrapping? What is the reasoning behind it?

I do know that 1.5.1 (built from source) has significantly higher performance than the pre-built 1.4 release. Connecting thousands of idle clients took 50% CPU with Mosquitto 1.4. With 1.5.1, it's about 5%. E_POLL is faster, obviously, but is TCP wrapping likely also detrimental to performance?

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 20, 2018

Contributor

The tcp wrappers library is just a different mechanism for allowing/denying access with /etc/hosts.allow and /etc/hosts.deny. It only comes into play when a new connection is made, so unless you have a high reconnection rate it shouldn't have any impact on performance.

Contributor

ralight commented Sep 20, 2018

The tcp wrappers library is just a different mechanism for allowing/denying access with /etc/hosts.allow and /etc/hosts.deny. It only comes into play when a new connection is made, so unless you have a high reconnection rate it shouldn't have any impact on performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment