New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with a websockets listener mosquitto triggers a segfault in libwebsockets #278

Open
mtreinish opened this Issue Sep 27, 2016 · 16 comments

Comments

Projects
None yet
4 participants
@mtreinish

mtreinish commented Sep 27, 2016

I'm running mosquitto 1.4.8 (from the ubuntu xenial packages) and running with a websocket listener configured. After establishing a number of connections over the websocket port I eventually hit a segfault on mosquitto. I haven't found the exact trigger yet so I just establish a few hundred websocket subscribers at once and run a couple daemons to publish messages and just wait for it to crash.

From dmesg I get:

mosquitto[15317] general protection ip:7f096c270a29 sp:7ffdac0c6530 error:0 in libwebsockets.so.7[7f096c268000+1f000]

and running it under gdb shows:

#0  0x00007ffff69fa072 in ?? () from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#1  0x00007ffff69ee072 in lws_callback_on_writable () from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#2  0x0000000000412b74 in _mosquitto_packet_queue (mosq=<optimized out>, packet=<optimized out>) at ../lib/net_mosq.c:169
#3  0x0000000000418d97 in _mosquitto_send_real_publish (mosq=<optimized out>, mid=<optimized out>, topic=<optimized out>, payloadlen=<optimized out>, payload=0x0, qos=<optimized out>, retain=false, dup=false) at ../lib/send_mosq.c:281
#4  0x0000000000418fc3 in _mosquitto_send_publish (mosq=mosq@entry=0xab9530, mid=<optimized out>, topic=0x9412a0 "gerrit/openstack/neutron/ref-replicated", payloadlen=184, payload=0xbb3970, qos=0, retain=false, dup=false) at ../lib/send_mosq.c:176
#5  0x000000000040cc9c in mqtt3_db_message_write (db=0x6268e0 <int_db>, context=0xab9530) at database.c:848
#6  0x0000000000419b04 in _subs_process (db=db@entry=0x6268e0 <int_db>, hier=hier@entry=0x66a5c0, source_id=source_id@entry=0x800760 "paho/34D87591DAA2E0874A", topic=<optimized out>, topic@entry=0x647ee0 "gerrit/openstack/neutron/ref-replicated", qos=qos@entry=0, retain=retain@entry=0, stored=0x7a6bc0, set_retain=false) at subs.c:135
#7  0x0000000000419d23 in _sub_search (db=db@entry=0x6268e0 <int_db>, subhier=subhier@entry=0x66a6e0, tokens=0xaba2a0, source_id=source_id@entry=0x800760 "paho/34D87591DAA2E0874A", topic=topic@entry=0x647ee0 "gerrit/openstack/neutron/ref-replicated", qos=qos@entry=0, retain=0, stored=0x7a6bc0, set_retain=true) at subs.c:422
#8  0x0000000000419c6f in _sub_search (db=db@entry=0x6268e0 <int_db>, subhier=subhier@entry=0x645120, tokens=tokens@entry=0x7d5370, source_id=source_id@entry=0x800760 "paho/34D87591DAA2E0874A", topic=topic@entry=0x647ee0 "gerrit/openstack/neutron/ref-replicated", qos=qos@entry=0, retain=0, stored=0x7a6bc0, set_retain=true) at subs.c:413
#9  0x000000000041a336 in mqtt3_db_messages_queue (db=db@entry=0x6268e0 <int_db>, source_id=0x800760 "paho/34D87591DAA2E0874A", topic=0x647ee0 "gerrit/openstack/neutron/ref-replicated", qos=qos@entry=0, retain=retain@entry=0, stored=stored@entry=0x7fffffffdd10) at subs.c:533
#10 0x0000000000413cad in mqtt3_handle_publish (db=0x6268e0 <int_db>, context=0xa1e630) at read_handle.c:233
#11 0x0000000000412d79 in _mosquitto_packet_read (db=db@entry=0x6268e0 <int_db>, mosq=mosq@entry=0xa1e630) at ../lib/net_mosq.c:1035
#12 0x000000000040e0f9 in loop_handle_reads_writes (pollfds=0x660320, db=0x6268e0 <int_db>) at loop.c:493
#13 mosquitto_main_loop (db=db@entry=0x6268e0 <int_db>, listensock=listensock@entry=0x6455e0, listensock_count=listensock_count@entry=2, listener_max=listener_max@entry=5) at loop.c:326
#14 0x0000000000404ab6 in main (argc=<optimized out>, argv=<optimized out>) at mosquitto.c:382

The libwebsockets version installed is 1.7.1 (also from the ubuntu xenial package)

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 28, 2016

oh, I probably should have looked more carefully at the open bug list. This is likely a dup of #253

mtreinish commented Sep 28, 2016

oh, I probably should have looked more carefully at the open bug list. This is likely a dup of #253

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 28, 2016

Contributor

Possibly... but you are using an up to date version of libwebsockets which makes a difference. Maybe I shouldn't have been dismissive of the problem in #253.

I don't suppose you'd be able to install the debug symbols for libwebsockets as well and run the broker under valgrind:

valgrind --log-file=vglog -v mosquitto -c mosquitto.conf # or whatever your mosquitto command line is

Contributor

ralight commented Sep 28, 2016

Possibly... but you are using an up to date version of libwebsockets which makes a difference. Maybe I shouldn't have been dismissive of the problem in #253.

I don't suppose you'd be able to install the debug symbols for libwebsockets as well and run the broker under valgrind:

valgrind --log-file=vglog -v mosquitto -c mosquitto.conf # or whatever your mosquitto command line is

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 28, 2016

Sure thing, I installed the debug symbols and ran it under valgrind, I've attached the valgrind log file:

vglog.txt

mtreinish commented Sep 28, 2016

Sure thing, I installed the debug symbols and ran it under valgrind, I've attached the valgrind log file:

vglog.txt

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 28, 2016

Contributor

Thanks, from a quick glance that gives a good hint. I'll try and get a proper look at it tonight.

Contributor

ralight commented Sep 28, 2016

Thanks, from a quick glance that gives a good hint. I'll try and get a proper look at it tonight.

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 28, 2016

Contributor

I believe this is failing within libwebsockets, but I don't know the root cause yet. The culprit looks to some http2 support that was added to lws. but it's not entirely clear. I might not be setting some options.

I haven't reproduced it yet, but I imagine I'm just doing things differently.

Would you be willing to try compiling a different version of libwebsockets? If you use 2.x you'll also need to recompile mosquitto.

Contributor

ralight commented Sep 28, 2016

I believe this is failing within libwebsockets, but I don't know the root cause yet. The culprit looks to some http2 support that was added to lws. but it's not entirely clear. I might not be setting some options.

I haven't reproduced it yet, but I imagine I'm just doing things differently.

Would you be willing to try compiling a different version of libwebsockets? If you use 2.x you'll also need to recompile mosquitto.

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 29, 2016

Sure, I'll work on running with a newer libwebsockets built from source tomorrow morning and see if I can reproduce it again.

mtreinish commented Sep 29, 2016

Sure, I'll work on running with a newer libwebsockets built from source tomorrow morning and see if I can reproduce it again.

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 29, 2016

FWIW, if you wanted to mirror my exact setup that I'm reproducing this in, it's pretty easy to mirror my environment. I just spun up a xenial vm and setup mosquitto using:

https://git.openstack.org/cgit/openstack-infra/puppet-mosquitto/

and used a puppet manifest like:

include 'mosquitto'
class {'mosquitto::server':
  infra_service_password => 'good_password',
}

to install and configure mosquitto.

On the host I also run a couple daemons that are constantly publishing messages (germqtt and lpmqtt but those are probably too specific to what I'm working on to be useful here)

then on a different vm or the host I've got a python script:

import paho.mqtt.client as mqtt

def on_connect(client, userdata, flags, rc):
    print("Connected with result code " + str(rc))
    client.subscribe('#')

def on_message(client, userdata, msg):
    print(msg.topic+" "+str(msg.payload))

client = mqtt.Client(transport="websockets")
client.on_connect = on_connect
client.on_message = on_message

client.connect(VM_IPADDR, port=80)
client.loop_forever()

which I launch backgrounded with stdout redirected into /dev/null in a bash for loop with like 500 iterations. Then it's just a matter of waiting and it'll crash eventually

mtreinish commented Sep 29, 2016

FWIW, if you wanted to mirror my exact setup that I'm reproducing this in, it's pretty easy to mirror my environment. I just spun up a xenial vm and setup mosquitto using:

https://git.openstack.org/cgit/openstack-infra/puppet-mosquitto/

and used a puppet manifest like:

include 'mosquitto'
class {'mosquitto::server':
  infra_service_password => 'good_password',
}

to install and configure mosquitto.

On the host I also run a couple daemons that are constantly publishing messages (germqtt and lpmqtt but those are probably too specific to what I'm working on to be useful here)

then on a different vm or the host I've got a python script:

import paho.mqtt.client as mqtt

def on_connect(client, userdata, flags, rc):
    print("Connected with result code " + str(rc))
    client.subscribe('#')

def on_message(client, userdata, msg):
    print(msg.topic+" "+str(msg.payload))

client = mqtt.Client(transport="websockets")
client.on_connect = on_connect
client.on_message = on_message

client.connect(VM_IPADDR, port=80)
client.loop_forever()

which I launch backgrounded with stdout redirected into /dev/null in a bash for loop with like 500 iterations. Then it's just a matter of waiting and it'll crash eventually

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 29, 2016

I've been running mosquitto at v1.4.10 and libwebsockets at v2.0.3 for the past 2.5 hours in my reproduce scenario and I haven't seen any crashes from the segfault. There are a large number of socket errors being logged by the broker on the websockets clients, (that's likely unrelated to this though) but nothing that's bringing down the broker like before. So I'm guessing it's probably an issue with the libwebsockets version packaged by ubuntu in xenial

mtreinish commented Sep 29, 2016

I've been running mosquitto at v1.4.10 and libwebsockets at v2.0.3 for the past 2.5 hours in my reproduce scenario and I haven't seen any crashes from the segfault. There are a large number of socket errors being logged by the broker on the websockets clients, (that's likely unrelated to this though) but nothing that's bringing down the broker like before. So I'm guessing it's probably an issue with the libwebsockets version packaged by ubuntu in xenial

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Sep 29, 2016

Contributor

Ok, that's good at least. I propose that we leave this open for the moment and I'll try and reproduce and narrow the problem down.

Thanks for the code!

Contributor

ralight commented Sep 29, 2016

Ok, that's good at least. I propose that we leave this open for the moment and I'll try and reproduce and narrow the problem down.

Thanks for the code!

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Sep 29, 2016

Actually looking a bit more closely at the socket errors I'm getting on the client connections they seem to happen pretty reliablly on all the websocket clients and just hang the client scripts I'm using.

I think that might be a separate bug that probably has to be dug into. After I do some more digging on it, I'll likely open another bug if I can tie down the cause

mtreinish commented Sep 29, 2016

Actually looking a bit more closely at the socket errors I'm getting on the client connections they seem to happen pretty reliablly on all the websocket clients and just hang the client scripts I'm using.

I think that might be a separate bug that probably has to be dug into. After I do some more digging on it, I'll likely open another bug if I can tie down the cause

@cordovapolymer

This comment has been minimized.

Show comment
Hide comment
@cordovapolymer

cordovapolymer Oct 23, 2016

As I noticed, the segfault occurs when processing very large messages, try with 100KB+ payload.

cordovapolymer commented Oct 23, 2016

As I noticed, the segfault occurs when processing very large messages, try with 100KB+ payload.

@hlandry

This comment has been minimized.

Show comment
Hide comment
@hlandry

hlandry Nov 23, 2016

I have the same problem. We do not send large payload so I don't think it's related to message size.

I use the default ubuntu 16.04 install.

hlandry commented Nov 23, 2016

I have the same problem. We do not send large payload so I don't think it's related to message size.

I use the default ubuntu 16.04 install.

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Apr 17, 2018

So, an update here we recently re-enabled the websockets listener on our broker (after disabling it do to this bug for a long time) without a second websockets tls listener. It ran that way without issue for a couple months. Then yesterday we turned back on the second websockets listener with tls enabled on it. As soon as we did that we hit the same general protection fault as before:

Apr 17 06:25:01 firehose01 kernel: [8333467.773948] traps: mosquitto[14869] general protection ip:7f27420f7072 sp:7ffdfea92490 error:0 in libwebsockets.so.7[7f27420e2000+1f000]

the interesting thing is the timing corresponded exactly to the log message for "Reloading config." (which was also the same time our cron.daily log rotation job is run) which reminds of the problem described in #253

mtreinish commented Apr 17, 2018

So, an update here we recently re-enabled the websockets listener on our broker (after disabling it do to this bug for a long time) without a second websockets tls listener. It ran that way without issue for a couple months. Then yesterday we turned back on the second websockets listener with tls enabled on it. As soon as we did that we hit the same general protection fault as before:

Apr 17 06:25:01 firehose01 kernel: [8333467.773948] traps: mosquitto[14869] general protection ip:7f27420f7072 sp:7ffdfea92490 error:0 in libwebsockets.so.7[7f27420e2000+1f000]

the interesting thing is the timing corresponded exactly to the log message for "Reloading config." (which was also the same time our cron.daily log rotation job is run) which reminds of the problem described in #253

@ralight

This comment has been minimized.

Show comment
Hide comment
@ralight

ralight Apr 18, 2018

Contributor

Is that still with 1.4.8 and default libwebsockets?

Contributor

ralight commented Apr 18, 2018

Is that still with 1.4.8 and default libwebsockets?

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Apr 18, 2018

@ralight we upgraded to using 1.4.15 from the mosquitto ppa after CVE-2017-7651 was disclosed. We didn't do anything special with libwebsockets though, so we're still using the default version in ubuntu 16.04. (which is 1.7.1 iirc)

mtreinish commented Apr 18, 2018

@ralight we upgraded to using 1.4.15 from the mosquitto ppa after CVE-2017-7651 was disclosed. We didn't do anything special with libwebsockets though, so we're still using the default version in ubuntu 16.04. (which is 1.7.1 iirc)

@mtreinish

This comment has been minimized.

Show comment
Hide comment
@mtreinish

mtreinish Apr 24, 2018

This happened again yesterday. We pulled the coredump from the crash file this time and it's definitely failing in the SIGHUP handling during the log rotation. The backtrace from the coredump was:

Core was generated by `/usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f27420f7072 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
(gdb) bt
#0  0x00007f27420f7072 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#1  0x00007f27420eb072 in lws_callback_on_writable ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#2  0x000000000040e2e9 in do_disconnect (db=db@entry=0x628960 <int_db>,
    context=context@entry=0x71e760) at loop.c:438
#3  0x00000000004195ba in mosquitto_security_apply_default (
    db=0x628960 <int_db>) at security_default.c:765
#4  0x0000000000417715 in mosquitto_security_apply (
    db=db@entry=0x628960 <int_db>) at security.c:204
#5  0x000000000040f151 in mosquitto_main_loop (db=db@entry=0x628960 <int_db>,
    listensock=listensock@entry=0x6a2ae0,
    listensock_count=listensock_count@entry=4, listener_max=<optimized out>)
    at loop.c:396
#6  0x0000000000404eea in main (argc=<optimized out>, argv=<optimized out>)
    at mosquitto.c:385

We saved the coredump so we can probe it for more info if needed.

Also, comparing this to the original debug info from about 1.5yrs ago this might be a separate (but similar) issue. Should I open this as a separate bug?

mtreinish commented Apr 24, 2018

This happened again yesterday. We pulled the coredump from the crash file this time and it's definitely failing in the SIGHUP handling during the log rotation. The backtrace from the coredump was:

Core was generated by `/usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f27420f7072 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
(gdb) bt
#0  0x00007f27420f7072 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#1  0x00007f27420eb072 in lws_callback_on_writable ()
   from /usr/lib/x86_64-linux-gnu/libwebsockets.so.7
#2  0x000000000040e2e9 in do_disconnect (db=db@entry=0x628960 <int_db>,
    context=context@entry=0x71e760) at loop.c:438
#3  0x00000000004195ba in mosquitto_security_apply_default (
    db=0x628960 <int_db>) at security_default.c:765
#4  0x0000000000417715 in mosquitto_security_apply (
    db=db@entry=0x628960 <int_db>) at security.c:204
#5  0x000000000040f151 in mosquitto_main_loop (db=db@entry=0x628960 <int_db>,
    listensock=listensock@entry=0x6a2ae0,
    listensock_count=listensock_count@entry=4, listener_max=<optimized out>)
    at loop.c:396
#6  0x0000000000404eea in main (argc=<optimized out>, argv=<optimized out>)
    at mosquitto.c:385

We saved the coredump so we can probe it for more info if needed.

Also, comparing this to the original debug info from about 1.5yrs ago this might be a separate (but similar) issue. Should I open this as a separate bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment