Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
AppSignal NIF is raising a segfault and killing the VM #113
For a few days ago, our production application was silently shutting down without any error logging or reporting.
Looking at the server syslog, I found these entries very close to the times where the application had shutdown (it happened twice yesterday):
My conclusion is that the extension is segfaulting at some unknown point (I don't have stacktraces due to how NIFs operate in the BEAM) and shutting down the whole application.
Wouldn't it be safer if the interaction with the extension was made using a Port? Ports fit nicely in a supervision tree and in a case like this, it will not bring the entire VM down. Given the focus of Erlang/Elixir into make reliable applications that can keep running for ages with almost no downtime, I think this is a pretty big deal.
Very sorry to hear this. We'll investigate this thoroughly. Which version of the package where you running on which OS? Did this start after upgrading to a newer version? I'm wondering if our recent changes for Alpine Linux support might have something to do with this.
Could you e-mail the full
We'd love to run this as a port, but the way ports work is not suitable for the way our monitoring works. The monitoring relies on multiple C calls during a transaction which gets some information while not influencing the host VM at all. Ports require serialisation and would prohibit using this approach. You can read some more about this here: http://docs.appsignal.com/elixir/why-nif.html
We are using Ubuntu 14.04.3 LTS and AppSignal 0.11.2
I'm not sure for how long the shutdowns have been happening, but it's probable that it started after this update.
What caused this?
There was a bug in the C source code for the integration which was introduced in the 0.10.0 release (released on January 17th). Besides not sending all available data to AppSignal, this bug caused a segfault in an application and was reported twice, after which we were able to reproduce it under the following circumstances.
What happened was that instead of sending a pointer to data to the agent, the C code sent a struct. The agent interpreted that as a
What did we do to fix it?
We found and fixed the bug on February 7th, and immediately released 0.11.6. To make sure this fix would solve the issue, we ran a benchmark with debug symbols on 0.11.2 to try to reproduce the issue, leading us to verify that this bug was actually causing the segfault.
What will we do to prevent this in the future?